Re: Question on Spark code

2017-07-23 Thread Reynold Xin
, it will return the same type with the level that you called it. >> >> On Sun, Jul 23, 2017 at 8:20 PM Reynold Xin <r...@databricks.com> wrote: >> >>> It means the same object ("this") is returned. >>> >>> On Sun,

Re: Question on Spark code

2017-07-23 Thread Reynold Xin
It means the same object ("this") is returned. On Sun, Jul 23, 2017 at 8:16 PM, tao zhan wrote: > Hello, > > I am new to scala and spark. > What does the "this.type" in set function for? > > > ​ > https://github.com/apache/spark/blob/481f0792944d9a77f0fe8b5e2596da >

Re: [VOTE] Apache Spark 2.1.1 (RC4)

2017-04-27 Thread Reynold Xin
+1 On Thu, Apr 27, 2017 at 11:59 AM Michael Armbrust wrote: > I'll also +1 > > On Thu, Apr 27, 2017 at 4:20 AM, Sean Owen wrote: > >> +1 , same result as with the last RC. All checks out for me. >> >> On Thu, Apr 27, 2017 at 1:29 AM Michael Armbrust

Re: Thoughts on release cadence?

2017-07-30 Thread Reynold Xin
This is reasonable ... +1 On Sun, Jul 30, 2017 at 2:19 AM, Sean Owen wrote: > The project had traditionally posted some guidance about upcoming > releases. The last release cycle was about 6 months. What about penciling > in December 2017 for 2.3.0?

Re: Interested in contributing to spark eco

2017-07-28 Thread Reynold Xin
Shashi, Welcome! There are a lot of ways you can help contribute. There is a page documenting some of them: http://spark.apache.org/contributing.html On Fri, Jul 28, 2017 at 1:35 PM, Shashi Dongur wrote: > Hello All, > > I am looking for ways to contribute to Spark repo.

Re: Increase Timeout or optimize Spark UT?

2017-08-20 Thread Reynold Xin
It seems like it's time to look into how to cut down some of the test runtimes. Test runtimes will slowly go up given the way development happens. 3 hr is already a very long time for tests to run. On Sun, Aug 20, 2017 at 5:45 PM, Dong Joon Hyun wrote: > Hi, All. > > > >

Re: Increase Timeout or optimize Spark UT?

2017-08-22 Thread Reynold Xin
is a serious overkill for most of > the test cases anyway. > > > Best, > Maciej > > > > On 21 August 2017 at 03:00, Dong Joon Hyun <dh...@hortonworks.com> wrote: > >> +1 for any efforts to recover Jenkins! >> >> >> >> Thank you f

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Reynold Xin
Yea I don't think it's a good idea to upload a doc and then call for a vote immediately. People need time to digest ... On Thu, Aug 17, 2017 at 6:22 AM, Wenchen Fan wrote: > Sorry let's remove the VOTE tag as I just wanna bring this up for > discussion. > > I'll restart

Re: spark.sql.codegen.comments not in SQLConf?

2017-05-10 Thread Reynold Xin
It's probably because it is annoying to propagate that using SQL conf. On Wed, May 10, 2017 at 3:38 AM Jacek Laskowski wrote: > Hi, > > It seems that spark.sql.codegen.comments property [1] didn't find its > place in SQLConf [2] that appears to be the place for all Spark >

Re: Question: why is Externalizable used?

2017-06-19 Thread Reynold Xin
I responded on the ticket. On Mon, Jun 19, 2017 at 2:36 AM, Sean Owen wrote: > Just wanted to call attention to this question, mostly because I'm curious: > https://github.com/apache/spark/pull/18343#issuecomment-309388668 > > Why is Externalizable (+ KryoSerializable) used

Re: An Update on Spark on Kubernetes [Jun 23]

2017-06-23 Thread Reynold Xin
Thanks, Anirudh. This is super helpful! On Fri, Jun 23, 2017 at 9:50 AM, Anirudh Ramanathan wrote: > *Project Description: *Kubernetes cluster manager integration that > enables native support for submitting Spark applications to a kubernetes > cluster. The submitted

[SPARK-21190] SPIP: Vectorized UDFs in Python

2017-06-23 Thread Reynold Xin
Welcome to the first real SPIP. SPIP: Vectorized UDFs for Python https://issues.apache.org/jira/browse/SPARK-21190 Background and Motivation: Python is one of the most popular programming languages among Spark users. Spark currently exposes a row-at-a-time interface for defining and

Re: New metrics for WindowExec with number of partitions and frames?

2017-05-26 Thread Reynold Xin
That would be useful (number of partitions). On Fri, May 26, 2017 at 3:24 PM Jacek Laskowski wrote: > Hi, > > Currently WindowExec gives no metrics in the web UI's Details for Query > page. > > What do you think about adding the number of partitions and frames? > That could

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-05-25 Thread Reynold Xin
Zoltan, Thanks for raising this again, although I'm a bit confused since I've communicated with you a few times on JIRA and on private emails to explain that you have some misunderstanding of the timestamp type in Spark and some of your statements are wrong (e.g. the except text file part). Not

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-05-26 Thread Reynold Xin
Spark wont change my timestamps. > > Ofir Manor > > Co-Founder & CTO | Equalum > > Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io > > On Thu, May 25, 2017 at 1:33 PM, Reynold Xin <r...@databricks.com> wrote: > >> Zoltan, >> >> Thanks for

Re: [PYTHON] PySpark typing hints

2017-05-23 Thread Reynold Xin
Seems useful to do. Is there a way to do this so it doesn't break Python 2.x? On Sun, May 14, 2017 at 11:44 PM, Maciej Szymkiewicz wrote: > Hi everyone, > > For the last few months I've been working on static type annotations for > PySpark. For those of you, who are

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-06-01 Thread Reynold Xin
Again (I've probably said this more than 10 times already in different threads), SPARK-18350 has no impact on whether the timestamp type is with timezone or without timezone. It simply allows a session specific timezone setting rather than having Spark always rely on the machine timezone. On Wed,

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-06-01 Thread Reynold Xin
does the standard dictate what the parsing behavior should be for >> timestamp (without time zone) when a time zone is present? >> >> * if it does and spark violates this standard is it worth trying to >> retain the *other* semantics of timestamp without time zone, even if we >&g

Re: [Spark SQL] Nanoseconds in Timestamps are set as Microseconds

2017-06-02 Thread Reynold Xin
Seems like a bug we should fix? I agree some form of truncation makes more sense. On Thu, Jun 1, 2017 at 1:17 AM, Anton Okolnychyi wrote: > Hi all, > > I would like to ask what the community thinks regarding the way how Spark > handles nanoseconds in the Timestamp

Re: the dependence length of RDD, can its size be greater than 1 pleaae?

2017-06-15 Thread Reynold Xin
A join? On Thu, Jun 15, 2017 at 1:11 AM 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all, > > The RDD code keeps a member as below: > dependencies_ : seq[Dependency[_]] > > It is a seq, that means it can keep more than one dependency. > > I have an issue about this. > Is it possible that its size is

Re: Custom Partitioning in Catalyst

2017-06-16 Thread Reynold Xin
Perhaps we should extend the data source API to support that. On Fri, Jun 16, 2017 at 11:37 AM, Russell Spitzer wrote: > I've been trying to work with making Catalyst Cassandra partitioning > aware. There seem to be two major blocks on this. > > The first is that

Re: Custom Partitioning in Catalyst

2017-06-16 Thread Reynold Xin
tioning as well? > > On Fri, Jun 16, 2017 at 11:58 AM Reynold Xin <r...@databricks.com> wrote: > >> Perhaps we should extend the data source API to support that. >> >> >> On Fri, Jun 16, 2017 at 11:37 AM, Russell Spitzer < >> russell.spit...@gmail.com> wro

Re: Hoping contribute code-Spark 2.1.1 Documentation

2017-05-02 Thread Reynold Xin
Liucht, Thanks for the interest. You are more than welcomed to contribute a pull request to fix the issue, at https://github.com/apache/spark On Tue, May 2, 2017 at 7:44 PM, cht liu wrote: > Hello,The Spark organizational leader : > This is my first time to

Re: [Spark Streaming] Dynamic Broadcast Variable Update

2017-05-05 Thread Reynold Xin
Thanks for the email. The process is to create a JIRA ticket and then post a design doc for discussion. You will of course need to update your code to work with the latest master branch, but you should wait oj that until the community has a chance to comment on the design. Cheers. On Fri, May

Re: PR permission to kick Jenkins?

2017-05-05 Thread Reynold Xin
I suspect the list is getting too big for Jenkins to function well. It stopped working for me a while ago. On Fri, May 5, 2017 at 12:06 PM, Tom Graves wrote: > Does anyone know how to configure Jenkins to allow committers to tell it > to test prs? I used to have

Re: Total memory tracking: request for comments

2017-09-20 Thread Reynold Xin
Thanks. This is an important direction to explore and my apologies for the late reply. One thing that is really hard about this is that with different layers of abstractions, we often use other libraries that might allocate large amount of memory (e.g. snappy library, Parquet itself), which makes

Re: [discuss] Data Source V2 write path

2017-09-20 Thread Reynold Xin
On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan wrote: > Hi all, > > I want to have some discussion about Data Source V2 write path before > starting a voting. > > The Data Source V1 write path asks implementations to write a DataFrame > directly, which is painful: > 1.

Re: [discuss] Data Source V2 write path

2017-09-21 Thread Reynold Xin
tioning? We already have this situation in the DataFrameWriter API, > where calling partitionBy and then insertInto throws an exception. I’d > like to keep that case out of this API by setting the expectation that > tables this writes to already exist. > > rb > ​ > > On Wed, S

Re: Signing releases with pwendell or release manager's key?

2017-09-18 Thread Reynold Xin
Does anybody know whether this is a hard blocker? If it is not, we should probably push 2.1.2 forward quickly and do the infrastructure improvement in parallel. On Mon, Sep 18, 2017 at 7:49 PM, Holden Karau wrote: > I'm more than willing to help migrate the scripts as part

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-05 Thread Reynold Xin
+1 On Mon, Oct 2, 2017 at 11:24 PM, Holden Karau wrote: > Please vote on releasing the following candidate as Apache Spark version 2 > .1.2. The vote is open until Saturday October 7th at 9:00 PST and passes > if a majority of at least 3 +1 PMC votes are cast. > > [ ] +1

Re: SparkR is now available on CRAN

2017-10-12 Thread Reynold Xin
This is huge! On Thu, Oct 12, 2017 at 11:21 AM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > Hi all > > I'm happy to announce that the most recent release of Spark, 2.1.2 is now > available for download as an R package from CRAN at >

Re: 2.1.2 maintenance release?

2017-09-08 Thread Reynold Xin
+1 as well. We should make a few maintenance releases. On Fri, Sep 8, 2017 at 6:46 PM Felix Cheung wrote: > +1 on both 2.1.2 and 2.2.1 > > And would try to help and/or wrangle the release if needed. > > (Note: trying to backport a few changes to branch-2.1 right now)

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Reynold Xin
; >> Essentially, the some APIs mix DDL and DML operations. I’d like to >> consider ways to fix that problem instead of carrying the problem forward >> to Data Source V2. We can solve this by adding a high-level API for DDL and >> a better write/insert API that works well with it. Clea

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Reynold Xin
that works well with it. Clearly, that discussion > is independent of the read path, which is why I think separating the two > proposals would be a win. > > rb > ​ > > On Wed, Aug 30, 2017 at 4:28 AM, Reynold Xin <r...@databricks.com> wrote: > >> That might be good t

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Reynold Xin
+1 as well On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust wrote: > +1 > > On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue > wrote: > >> +1 (non-binding) >> >> Thanks for making the updates reflected in the current PR. It would be >> great to see

Re: [discuss] Data Source V2 write path

2017-09-24 Thread Reynold Xin
Can there be an explicit create function? On Sun, Sep 24, 2017 at 7:17 PM, Wenchen Fan wrote: > I agree it would be a clean approach if data source is only responsible to > write into an already-configured table. However, without catalog > federation, Spark doesn't have an

Re: Should Flume integration be behind a profile?

2017-10-01 Thread Reynold Xin
Probably should do 1, and then it is an easier transition in 3.0. On Sun, Oct 1, 2017 at 1:28 AM Sean Owen wrote: > I tried and failed to do this in > https://issues.apache.org/jira/browse/SPARK-22142 because it became clear > that the Flume examples would have to be removed

Re: Configuration docs pages are broken

2017-10-03 Thread Reynold Xin
Interested in submitting a patch to fix them? On Tue, Oct 3, 2017 at 9:53 AM Nick Dimiduk wrote: > Heya, > > Looks like the Configuration sections of your docs, both latest [0], and > 2.1 [1] are broken. The last couple sections are smashed into a single > unrendered

Re: [Spark Core] Custom Catalog. Integration between Apache Ignite and Apache Spark

2017-09-25 Thread Reynold Xin
It's probably just an indication of lack of interest (or at least there isn't a substantial overlap between Ignite users and Spark users). A new catalog implementation is also pretty fundamental to Spark and the bar for that would be pretty high. See my comment in SPARK-17767. Guys - while I

Re: [VOTE] Spark 2.1.2 (RC2)

2017-09-27 Thread Reynold Xin
+1 On Tue, Sep 26, 2017 at 9:47 PM, Holden Karau wrote: > Please vote on releasing the following candidate as Apache Spark version 2 > .1.2. The vote is open until Wednesday October 4th at 23:59 PST and > passes if a majority of at least 3 +1 PMC votes are cast. > > [ ]

Re: Are there multiple processes out there running JIRA <-> Github maintenance tasks?

2017-08-28 Thread Reynold Xin
The process for doing that was down before, and might've come back up and are going through the huge backlog. On Mon, Aug 28, 2017 at 6:56 PM, Sean Owen wrote: > Like whatever reassigns JIRAs after a PR is closed? > > It seems to be going crazy, or maybe there are many

Re: SPIP: Spark on Kubernetes

2017-08-17 Thread Reynold Xin
+1 on adding Kubernetes support in Spark (as a separate module similar to how YARN is done) I talk with a lot of developers and teams that operate cloud services, and k8s in the last year has definitely become one of the key projects, if not the one with the strongest momentum in this space. I'm

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-29 Thread Reynold Xin
James, Thanks for the comment. I think you just pointed out a trade-off between expressiveness and API simplicity, compatibility and evolvability. For the max expressiveness, we'd want the ability to expose full query plans, and let the data source decide which part of the query plan can be

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-01 Thread Reynold Xin
RA and we will > be adding those later separately. > > Thanks. > > > On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <r...@databricks.com> wrote: > >> Is the idea aggregate is out of scope for the current effort and we will >> be adding those later? >> >

Re: SPIP: Spark on Kubernetes

2017-09-01 Thread Reynold Xin
m all cluster managers to make progress on it. > A proposal regarding this will be in SPARK-19700 > <https://issues.apache.org/jira/browse/SPARK-19700>. > > This vote has passed. > So far, there have been 4 binding +1 votes, ~25 non-binding votes, and no > -1 votes. > > Thanks all!

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Reynold Xin
private List currentFilters = emptyList(); >>>> private Supplier barSupplier = newSupplier(currentFilters); >>>> >>>> public CachingFoo(Foo delegate) { >>>> this.delegate = delegate; >>>> } >>>> >>>

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-09-01 Thread Reynold Xin
re >> straightforward - the state management is painful atm). >> >> James >> >> On Wed, 30 Aug 2017 at 14:56 Reynold Xin <r...@databricks.com> wrote: >> >>> Sure that's good to do (and as discussed earlier a good compromise might >&

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-01 Thread Reynold Xin
Is the idea aggregate is out of scope for the current effort and we will be adding those later? On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN wrote: > Hi all, > > We've been discussing to support vectorized UDFs in Python and we almost > got a consensus about the APIs, so

Re: SPIP: Spark on Kubernetes

2017-08-30 Thread Reynold Xin
This has passed, hasn't it? On Tue, Aug 15, 2017 at 5:33 PM Anirudh Ramanathan wrote: > Spark on Kubernetes effort has been developed separately in a fork, and > linked back from the Apache Spark project as an experimental backend >

Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-09 Thread Reynold Xin
+1 One thing with MetadataSupport - It's a bad idea to call it that unless adding new functions in that trait wouldn't break source/binary compatibility in the future. On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan wrote: > I'm adding my own +1 (binding). > > On Tue, Oct 10,

Re: Interested to Contribute in Spark Development

2017-10-04 Thread Reynold Xin
Kumar, This is a good start: http://spark.apache.org/contributing.html On Wed, Oct 4, 2017 at 10:00 AM, vaquar khan wrote: > Hi Nishant, > > 1) Start with helping spark users on mailing list and stack . > > 2) Start helping build and testing. > > 3) Once comfortable

[discuss] SPIP: Continuous Processing Mode for Structured Streaming

2017-10-23 Thread Reynold Xin
Please take a look at the attached PDF for the SPIP: Continuous Processing Mode for Structured Streaming https://issues.apache.org/jira/browse/SPARK-20928 It is meant to be a very small, surgical change to Structured Streaming to enable ultra-low latency. This is great timing because we are also

Re: [discuss][SQL] Partitioned column type inference proposal

2017-11-14 Thread Reynold Xin
Most of those thoughts from Wenchen make sense to me. Rather than a list, can we create a table? X-axis is data type, and Y-axis is also data type, and the intersection explains what the coerced type is? Can we also look at what Hive, standard SQL (Postgres?) do? Also, this shouldn't be

Re: OutputMetrics empty for DF writes - any hints?

2017-11-27 Thread Reynold Xin
Is this due to the insert command not having metrics? It's a problem we should fix. On Mon, Nov 27, 2017 at 10:45 AM, Jason White wrote: > I'd like to use the SparkListenerInterface to listen for some metrics for > monitoring/logging/metadata purposes. The first ones

Re: [RESULT][VOTE] Spark 2.2.1 (RC2)

2017-12-01 Thread Reynold Xin
Congrats. On Fri, Dec 1, 2017 at 12:10 AM, Felix Cheung wrote: > This vote passes. Thanks everyone for testing this release. > > > +1: > > Sean Owen (binding) > > Herman van Hövell tot Westerflier (binding) > > Wenchen Fan (binding) > > Shivaram Venkataraman (binding) >

Re: Decimals

2017-12-13 Thread Reynold Xin
Responses inline On Tue, Dec 12, 2017 at 2:54 AM, Marco Gaido wrote: > Hi all, > > I saw in these weeks that there are a lot of problems related to decimal > values (SPARK-22036, SPARK-22755, for instance). Some are related to > historical choices, which I don't know,

Re: [01/51] [partial] spark-website git commit: 2.2.1 generated doc

2017-12-17 Thread Reynold Xin
There is an additional step that's needed to update the symlink, and that step hasn't been done yet. On Sun, Dec 17, 2017 at 12:32 PM, Jacek Laskowski wrote: > Hi Sean, > > What does "Not all the pieces are released yet" mean if you don't mind me > asking? 2.2.1 has already

Re: Faster and Lower memory implementation toPandas

2017-11-16 Thread Reynold Xin
Please send a PR. Thanks for looking at this. On Thu, Nov 16, 2017 at 7:27 AM Andrew Andrade wrote: > Hello devs, > > I know a lot of great work has been done recently with pandas to spark > dataframes and vice versa using Apache Arrow, but I faced a specific pain >

Re: how to replace hdfs with a custom distributed fs ?

2017-11-11 Thread Reynold Xin
You can implement the Hadoop FileSystem API for your distributed java fs and just plug into Spark using the Hadoop API. On Sat, Nov 11, 2017 at 9:37 AM, Cristian Lorenzetto < cristian.lorenze...@gmail.com> wrote: > hi i have my distributed java fs and i would like to implement my class > for

Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Reynold Xin
Why tie a maintenance release to a feature release? They are supposed to be independent and we should be able to make a lot of maintenance releases as needed. On Thu, Nov 2, 2017 at 7:13 PM Sean Owen wrote: > The feature freeze is "mid November" : >

Re: [SS] Custom Sinks

2017-11-01 Thread Reynold Xin
They will probably both change, but I wouldn't block on the change if you have an immediate need. On Wed, Nov 1, 2017 at 10:41 AM, Anton Okolnychyi < anton.okolnyc...@gmail.com> wrote: > Hi all, > > I have a question about the future of custom data sinks in Structured > Streaming. In

Re: Jenkins upgrade/Test Parallelization & Containerization

2017-11-07 Thread Reynold Xin
My understanding is that AMP actually can provide more resources or adapt changes, while ASF needs to manage 200+ projects and it's hard to accommodate much. I could be wrong though. On Tue, Nov 7, 2017 at 2:14 PM, Holden Karau wrote: > True, I think we've seen that the

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-07 Thread Reynold Xin
The vote has passed with the following +1s: Reynold Xin* Debasish Das Noman Khan Wenchen Fan* Matei Zaharia* Weichen Xu Vaquar Khan Burak Yavuz Xiao Li Tom Graves* Michael Armbrust* Joseph Bradley* Shixiong Zhu* And the following +0s: Sean Owen* Thanks for the feedback! On Wed, Nov 1, 2017

[Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-01 Thread Reynold Xin
Earlier I sent out a discussion thread for CP in Structured Streaming: https://issues.apache.org/jira/browse/SPARK-20928 It is meant to be a very small, surgical change to Structured Streaming to enable ultra-low latency. This is great timing because we are also designing and implementing data

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-01 Thread Reynold Xin
I just replied. On Wed, Nov 1, 2017 at 5:50 PM, Cody Koeninger <c...@koeninger.org> wrote: > Was there any answer to my question around the effect of changes to > the sink api regarding access to underlying offsets? > > On Wed, Nov 1, 2017 at 11:32 AM, Reynold Xin <r...@

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-06 Thread Reynold Xin
esday, November 1, 2017, 11:29:21 AM CDT, Debasish Das < > debasish.da...@gmail.com> wrote: > > > +1 > > Is there any design doc related to API/internal changes ? Will CP be the > default in structured streaming or it's a mode in conjunction with > exisiting behavior. &g

Re: Dataset API Question

2017-10-25 Thread Reynold Xin
It is a bit more than syntactic sugar, but not much more: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L533 BTW this is basically writing all the data out, and then create a new Dataset to load them in. On Wed, Oct 25, 2017 at 6:51 AM,

Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
nd do the > same thing. > > rb > ​ > > On Tue, May 8, 2018 at 3:47 PM, Reynold Xin <r...@databricks.com> wrote: > >> s/underestimated/overestimated/ >> >> On Tue, May 8, 2018 at 3:44 PM Reynold Xin <r...@databricks.com> wrote: >> >>> Marco,

Re: [DISCUSS] Spark SQL internal data: InternalRow or UnsafeRow?

2018-05-08 Thread Reynold Xin
SQL engine must be > UnsafeRow? > > Personally, I think it makes sense to say that everything should accept > InternalRow, but produce UnsafeRow, with the understanding that UnsafeRow > will usually perform better. > > rb > ​ > > On Tue, May 8, 2018 at 4:09 PM, Reyn

Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
s/underestimated/overestimated/ On Tue, May 8, 2018 at 3:44 PM Reynold Xin <r...@databricks.com> wrote: > Marco, > > There is understanding how Spark works, and there is finding bugs early in > their own program. One can perfectly understand how Spark works and still > fi

eager execution and debuggability

2018-05-08 Thread Reynold Xin
Similar to the thread yesterday about improving ML/DL integration, I'm sending another email on what I've learned recently from Spark users. I recently talked to some educators that have been teaching Spark in their (top-tier) university classes. They are some of the most important users for

Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
t;> explaining there is an error in the .write operation and they are debugging >>> the writing to disk, focusing on that piece of code :) >>> >>> unrelated, but another frequent cause for confusion is cascading errors. >>> like the FetchFailedException. they will be debug

Re: [DISCUSS] Spark SQL internal data: InternalRow or UnsafeRow?

2018-05-08 Thread Reynold Xin
What the internal operators do are strictly internal. To take one step back, is the goal to design an API so the consumers of the API can directly produces what Spark expects internally, to cut down perf cost? On Tue, May 8, 2018 at 1:22 PM Ryan Blue wrote: > While

parser error?

2018-05-13 Thread Reynold Xin
Just saw this in one of my PR that's doc only: [error] warning(154): SqlBase.g4:400:0: rule fromClause contains an optional block with at least one alternative that can match an empty string

Integrating ML/DL frameworks with Spark

2018-05-07 Thread Reynold Xin
Hi all, Xiangrui and I were discussing with a heavy Apache Spark user last week on their experiences integrating machine learning (and deep learning) frameworks with Spark and some of their pain points. Couple things were obvious and I wanted to share our learnings with the list. (1) Most

Re: Integrating ML/DL frameworks with Spark

2018-05-07 Thread Reynold Xin
zing the offline discussion! I added a few > comments inline. -Xiangrui > > On Mon, May 7, 2018 at 5:37 PM Reynold Xin <r...@databricks.com> wrote: > >> Hi all, >> >> Xiangrui and I were discussing with a heavy Apache Spark user last week >> on their experience

Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
with several > debug actions. And this would benefit new and experienced users alike. > > Nick > > 2018년 5월 8일 (화) 오후 7:09, Ryan Blue rb...@netflix.com.invalid > <http://mailto:rb...@netflix.com.invalid>님이 작성: > > I've opened SPARK-24215 to track this. >> >>

Re: Documenting the various DataFrame/SQL join types

2018-05-08 Thread Reynold Xin
Would be great to document. Probably best with examples. On Tue, May 8, 2018 at 6:13 AM Nicholas Chammas wrote: > The documentation for DataFrame.join() > > lists all the

Re: Integrating ML/DL frameworks with Spark

2018-05-08 Thread Reynold Xin
I think that's what Xiangrui was referring to. Instead of retrying a single task, retry the entire stage, and the entire stage of tasks need to be scheduled all at once. On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > >> >>>- Fault tolerance and

Re: Integrating ML/DL frameworks with Spark

2018-05-08 Thread Reynold Xin
; i.e. > > if some frameworks like xgboost/mxnet requires 50 parallel workers, Spark > is desired to provide a capability to ensure that either we run 50 tasks at > once, or we should quit the complete application/job after some timeout > period > > Best, > > Nan > &

Re: Running lint-java during PR builds?

2018-05-21 Thread Reynold Xin
Can we look into if there is a plugin for sbt that works and then we can put everything into one single builder? On Mon, May 21, 2018 at 11:17 AM Dongjoon Hyun wrote: > Thank you for reconsidering this, Hyukjin. :) > > Bests, > Dongjoon. > > > On Mon, May 21, 2018 at

Re: Optimizer rule ConvertToLocalRelation causes expressions to be eager-evaluated in Planning phase

2018-06-08 Thread Reynold Xin
But from the user's perspective, optimization is not run right? So it is still lazy. On Fri, Jun 8, 2018 at 12:35 PM Li Jin wrote: > Hi All, > > Sorry for the long email title. I am a bit surprised to find that the > current optimizer rule "ConvertToLocalRelation" causes expressions to be >

Re: Missing HiveConf when starting PySpark from head

2018-06-14 Thread Reynold Xin
The behavior change is not good... On Thu, Jun 14, 2018 at 9:05 AM Li Jin wrote: > Ah, looks like it's this change: > > https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5 > > It seems strange that by default Spark doesn't build

Re: [VOTE] SPIP ML Pipelines in R

2018-06-14 Thread Reynold Xin
+1 on the proposal. On Fri, Jun 1, 2018 at 8:17 PM Hossein wrote: > Hi Shivaram, > > We converged on a CRAN release process that seems identical to current > SparkR. > > --Hossein > > On Thu, May 31, 2018 at 9:10 AM, Shivaram Venkataraman < > shiva...@eecs.berkeley.edu> wrote: > >> Hossein --

Re: time for Apache Spark 3.0?

2018-06-15 Thread Reynold Xin
project. It > should be better to drift away the historical burden and focus in new area. > Spark has been widely used all over the world as a successful big data > framework. And it can be better than that. > >> > >> Andy > >> > >> > >> On Thu, Apr

Re: [SQL] Purpose of RuntimeReplaceable unevaluable unary expressions?

2018-05-30 Thread Reynold Xin
SQL expressions? On Wed, May 30, 2018 at 11:09 AM Jacek Laskowski wrote: > Hi, > > I've been exploring RuntimeReplaceable expressions [1] and have been > wondering what their purpose is. > > Quoting the scaladoc [2]: > > > An expression that gets replaced at runtime (currently by the optimizer)

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Reynold Xin
+1 On Fri, Jun 1, 2018 at 3:29 PM Marcelo Vanzin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.3.1. > > Given that I expect at least a few people to be busy with Spark Summit next > week, I'm taking the liberty of setting an extended voting period. The

Re: [VOTE] Spark 2.3.1 (RC3)

2018-06-01 Thread Reynold Xin
Yes everybody please cc the release manager on changes that merit -1. It's high overhead and let's make this smoother. On Fri, Jun 1, 2018 at 1:28 PM Marcelo Vanzin wrote: > Xiao, > > This is the third time in this release cycle that this is happening. > Sorry to single out you guys, but can

Re: What about additional support on deeply nested data?

2018-06-20 Thread Reynold Xin
Seems like you are also looking for transform and reduce for arrays? https://issues.apache.org/jira/browse/SPARK-23908 https://issues.apache.org/jira/browse/SPARK-23911 On Wed, Jun 20, 2018 at 10:43 AM bobotu wrote: > I store some trajectories data in parquet with this schema: > > create

Re: Beam's recent community development work

2018-07-02 Thread Reynold Xin
That's fair, and it's great to find high quality contributors. But I also feel the two projects have very different background and maturity phase. There are 1300+ contributors to Spark, and only 300 to Beam, with the vast majority of contributions coming from a single company for Beam (based on my

Re: Feature request: Java-specific transform method in Dataset

2018-07-01 Thread Reynold Xin
This wouldn’t be a problem with Scala 2.12 right? On Sun, Jul 1, 2018 at 12:23 PM Sean Owen wrote: > I see, transform() doesn't have the same overload that other methods do in > order to support Java 8 lambdas as you'd expect. One option is to introduce > something like MapFunction for

Re: Identifying specific persisted DataFrames via getPersistentRDDs()

2018-05-03 Thread Reynold Xin
Why do you need the underlying RDDs? Can't you just unpersist the dataframes that you don't need? On Mon, Apr 30, 2018 at 8:17 PM Nicholas Chammas wrote: > This seems to be an underexposed part of the API. My use case is this: I > want to unpersist all DataFrames

Re: Anyone knows how to build and spark on jdk9?

2017-10-26 Thread Reynold Xin
It probably depends on the Scala version we use in Spark supporting Java 9 first. On Thu, Oct 26, 2017 at 7:22 PM Zhang, Liyun wrote: > Hi all: > > 1. I want to build spark on jdk9 and test it with Hadoop on jdk9 > env. I search for jiras related to JDK9. I only

Re: Spark-XML maintenance

2017-10-26 Thread Reynold Xin
Adding Hyukjin who has been maintaining it. The easiest is probably to leave comments in the repo. On Thu, Oct 26, 2017 at 9:44 AM Jörn Franke wrote: > I would address databricks with this issue - it is their repository > > > On 26. Oct 2017, at 18:43, comtef

Re: Result obtained before the completion of Stages

2017-12-27 Thread Reynold Xin
Is it possible there is a bug for the UI? If you can run jstack on the executor process to see whether anything is actually running, that can help narrow down the issue. On Tue, Dec 26, 2017 at 10:28 PM ckhari4u wrote: > Hi Reynold, > > I am running a Spark SQL query. > >

Re: Integration testing and Scheduler Backends

2018-01-09 Thread Reynold Xin
If we can actually get our acts together and have integration tests in Jenkins (perhaps not run on every commit but can be run weekly or pre-release smoke tests), that'd be great. Then it relies less on contributors manually testing. On Tue, Jan 9, 2018 at 8:09 AM, Timothy Chen

Re: Kryo 4 serialized form changes -- a problem?

2018-01-19 Thread Reynold Xin
I don’t think Spark relies on Kryo or Java for persistence. User programs might though so it would be great if we can shade it. On Fri, Jan 19, 2018 at 5:55 AM Sean Owen wrote: > See: > > https://issues.apache.org/jira/browse/SPARK-23131 >

Re: Spark 3

2018-01-19 Thread Reynold Xin
We can certainly provide a build for Scala 2.12, even in 2.x. On Fri, Jan 19, 2018 at 10:17 AM, Justin Miller < justin.mil...@protectwise.com> wrote: > Would that mean supporting both 2.12 and 2.11? Could be a while before > some of our libraries are off of 2.11. > > Thanks, > Justin > > > On

Re: ***UNCHECKED*** [jira] [Resolved] (SPARK-23218) simplify ColumnVector.getArray

2018-01-26 Thread Reynold Xin
I have no idea. Some JIRA update? Might want to file an INFRA ticket. On Fri, Jan 26, 2018 at 10:04 AM, Sean Owen wrote: > This is an example of the "*** UNCHECKED ***" message I was talking about > -- it's part of the email subject rather than JIRA. > > --

Re: What is "*** UNCHECKED ***"?

2018-01-26 Thread Reynold Xin
Examples? On Fri, Jan 26, 2018 at 9:56 AM, Sean Owen wrote: > I probably missed this, but what is the new "*** UNCHECKED ***" message in > the subject line of some JIRAs? >

<    5   6   7   8   9   10   11   12   13   >