data source api v2 refactoring

2018-08-31 Thread Reynold Xin
I spent some time last week looking at the current data source v2 apis, and I thought we should be a bit more buttoned up in terms of the abstractions and the guarantees Spark provides. In particular, I feel we need the following levels of "abstractions", to fit the use cases in Spark, from batch,

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-30 Thread Reynold Xin
Let's see how they go. At some point we do need to cut the release. That argument can be made on every feature, and different people place different value / importance on different features, so we could just end up never making a release. On Thu, Aug 30, 2018 at 1:56 PM antonkulaga wrote: >

Re: python tests: any reason for a huge tests.py?

2018-08-24 Thread Reynold Xin
We should break it. On Fri, Aug 24, 2018 at 9:53 AM Imran Rashid wrote: > Hi, > > another question from looking more at python recently. Is there any > reason we've got a ton of tests in one humongous tests.py file, rather than > breaking it out into smaller files? > > Having one huge file

Re: Porting or explicitly linking project style in Apache Spark based on https://github.com/databricks/scala-style-guide

2018-08-23 Thread Reynold Xin
I wrote both the Spark one and later the Databricks one. The latter had a lot more work put into it and is consistent with the Spark style. I'd just use the second one and link to it, if possible. On Thu, Aug 23, 2018 at 6:38 PM Hyukjin Kwon wrote: > If you meant "Code Style Guide", many of

Re: Spark DataFrame UNPIVOT feature

2018-08-21 Thread Reynold Xin
Probably just because it is not used that often and nobody has submitted a patch for it. I've used pivot probably on average once a week (primarily in spreadsheets), but I've never used unpivot ... On Tue, Aug 21, 2018 at 3:06 PM Ivan Gozali wrote: > Hi there, > > I was looking into why the

Re: [DISCUSS] SparkR support on k8s back-end for Spark 2.4

2018-08-15 Thread Reynold Xin
g 15, 2018 at 2:45 PM Reynold Xin wrote: > What's the reason we don't want to do the OS updates right now? Is it due > to the unpredictability of potential issues that might happen and end up > delaying 2.4 release? > > > On Wed, Aug 15, 2018 at 2:33 PM Erik Erlandson > wr

Re: [DISCUSS] SparkR support on k8s back-end for Spark 2.4

2018-08-15 Thread Reynold Xin
What's the reason we don't want to do the OS updates right now? Is it due to the unpredictability of potential issues that might happen and end up delaying 2.4 release? On Wed, Aug 15, 2018 at 2:33 PM Erik Erlandson wrote: > The SparkR support PR is finished, along with integration testing,

Re: Naming policy for packages

2018-08-15 Thread Reynold Xin
craps? :( On Wed, Aug 15, 2018 at 11:47 AM Koert Kuipers wrote: > ok it doesnt sound so bad if the maven identifier can have spark it in. no > big deal! > > otherwise i was going to suggest "kraps". like kraps-xml > > scala> "spark".reverse > res0: String = kraps > > > On Wed, Aug 15, 2018 at

Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-08-15 Thread Reynold Xin
https://github.com/apache/spark/pull/21306>, I just left it out >> entirely. What if I just removed it from the proposal and we can add it >> later? >> ​ >> >> On Thu, Jul 26, 2018 at 4:32 PM Reynold Xin wrote: >> >>> Seems reasonable at high level. I

Re: Naming policy for packages

2018-08-15 Thread Reynold Xin
Unfortunately that’s an Apache foundation policy and the Spark community has no power to change it. My understanding: The reason Spark can’t be in the name is because if it is used frequently enough, the foundation would lose the Spark trademark. Cheers. On Wed, Aug 15, 2018 at 7:19 AM Simon

Re: [R] discuss: removing lint-r checks for old branches

2018-08-10 Thread Reynold Xin
SGTM On Fri, Aug 10, 2018 at 1:39 PM shane knapp wrote: > https://issues.apache.org/jira/browse/SPARK-25089 > > basically since these branches are old, and there will be a greater than > zero amount of work to get lint-r to pass (on the new ubuntu workers), sean > and i are proposing to remove

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Reynold Xin
I actually totally agree that we should make sure it should have no impact on existing code if the feature is not used. On Tue, Jul 31, 2018 at 1:18 PM Erik Erlandson wrote: > I don't have a comprehensive knowledge of the project hydrogen PRs, > however I've perused them, and they make

Re: Review notification bot

2018-07-30 Thread Reynold Xin
I like the idea of this bot, but I'm somewhat annoyed by it. I have touched a lot of files and wrote a lot of the original code. Everyday I wake up I get a lot of emails from this bot. Also if we are going to use this, can we rename the bot to something like spark-bot, rather than holden's

Re: Why percentile and distinct are not done in one job?

2018-07-30 Thread Reynold Xin
Which API are you talking about? On Mon, Jul 30, 2018 at 7:03 AM 吴晓菊 wrote: > I noticed that in column analyzing, 2 jobs will run separately to > calculate percentiles and then distinct. Why not combine into one job since > HyperLogLog also supports merge? > > Chrysan Wu > Phone:+86 17717640807

Re: [Spark SQL] Future of CalendarInterval

2018-07-27 Thread Reynold Xin
CalendarInterval is definitely externally visible. E.g. sql("select interval 1 day").dtypes would return "Array[(String, String)] = Array((interval 1 days,CalendarIntervalType))" However, I'm not sure what it means to support casting. What are the semantics for casting from any other data type

Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-07-26 Thread Reynold Xin
Seems reasonable at high level. I don't think we can use Expression's and SortOrder's in public APIs though. Those are not meant to be public and can break easily across versions. On Tue, Jul 24, 2018 at 9:26 AM Ryan Blue wrote: > The recently adopted SPIP to standardize logical plans requires

Re: [DISCUSS][SQL] Control the number of output files

2018-07-26 Thread Reynold Xin
barrier >>>> scheduling at the Stage level -- so it is not completely obvious how to >>>> unify all of these policy options/preferences/mechanism, or whether it is >>>> possible, but I think it is worth considering such things at a fairly high >>>> level of abs

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Reynold Xin
Seems like a good idea in general. Do other systems have similar concepts? In general it'd be easier if we can follow existing convention if there is any. On Wed, Jul 25, 2018 at 11:50 AM John Zhuge wrote: > Hi all, > > Many Spark users in my company are asking for a way to control the number

Re: [SPARK-24865] Remove AnalysisBarrier

2018-07-19 Thread Reynold Xin
bypassTransformAnalyzerCheck method. On Thu, Jul 19, 2018 at 2:52 PM Reynold Xin wrote: > We have had multiple bugs introduced by AnalysisBarrier. In hindsight I > think the original design before analysis barrier was much simpler and > requires less developer knowledge of the infra

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-19 Thread Reynold Xin
Looking at the list of pull requests it looks like this is the ticket: https://issues.apache.org/jira/browse/SPARK-24867 On Thu, Jul 19, 2018 at 5:25 PM Reynold Xin wrote: > I don't think my ticket should block this release. It's a big general > refactoring. > > Xiao do you h

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-19 Thread Reynold Xin
I don't think my ticket should block this release. It's a big general refactoring. Xiao do you have a ticket for the bug you found? On Thu, Jul 19, 2018 at 5:24 PM Saisai Shao wrote: > Hi Xiao, > > Are you referring to this JIRA ( > https://issues.apache.org/jira/browse/SPARK-24865)? > > Xiao

[SPARK-24865] Remove AnalysisBarrier

2018-07-19 Thread Reynold Xin
We have had multiple bugs introduced by AnalysisBarrier. In hindsight I think the original design before analysis barrier was much simpler and requires less developer knowledge of the infrastructure. As long as analysis barrier is there, developers writing various code in analyzer will have to be

Re: [VOTE] SPIP: Standardize SQL logical plans

2018-07-18 Thread Reynold Xin
+1 on this, on the condition that we can come up with a design that will remove the existing plans. On Tue, Jul 17, 2018 at 11:00 AM Ryan Blue wrote: > Hi everyone, > > From discussion on the proposal doc and the discussion thread, I think we > have consensus around the plan to standardize

Re: Cleaning Spark releases from mirrors, and the flakiness of HiveExternalCatalogVersionsSuite

2018-07-15 Thread Reynold Xin
Makes sense. Thanks for looking into this. On Sun, Jul 15, 2018 at 1:51 PM Sean Owen wrote: > Yesterday I cleaned out old Spark releases from the mirror system -- we're > supposed to only keep the latest release from active branches out on > mirrors. (All releases are available from the Apache

Re: [SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

2018-07-08 Thread Reynold Xin
Yes I would just reuse the same function. On Sun, Jul 8, 2018 at 5:01 AM Li Jin wrote: > Hi Linar, > > This seems useful. But perhaps reusing the same function name is better? > > > http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame > >

Re: [DESIGN] Barrier Execution Mode

2018-07-08 Thread Reynold Xin
Xingbo, Please reference the spip and jira ticket next time: [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark On Sun, Jul 8, 2018 at 9:45 AM Xingbo Jiang wrote: > Hi All, > > I would like to invite you to review the design document for Barrier > Execution Mode: > >

code freeze and branch cut for Apache Spark 2.4

2018-07-06 Thread Reynold Xin
FYI 6 mo is coming up soon since the last release. We will cut the branch and code freeze on Aug 1st in order to get 2.4 out on time.

Re: Beam's recent community development work

2018-07-02 Thread Reynold Xin
That's fair, and it's great to find high quality contributors. But I also feel the two projects have very different background and maturity phase. There are 1300+ contributors to Spark, and only 300 to Beam, with the vast majority of contributions coming from a single company for Beam (based on my

Re: Feature request: Java-specific transform method in Dataset

2018-07-01 Thread Reynold Xin
This wouldn’t be a problem with Scala 2.12 right? On Sun, Jul 1, 2018 at 12:23 PM Sean Owen wrote: > I see, transform() doesn't have the same overload that other methods do in > order to support Java 8 lambdas as you'd expect. One option is to introduce > something like MapFunction for

Re: LICENSE and NOTICE file content

2018-06-21 Thread Reynold Xin
Thanks Justin. Can you submit a pull request? On Thu, Jun 21, 2018 at 8:10 PM Justin Mclean wrote: > Hi, > > We’ve recently had a number of incubating projects copy your LICENSE and > NOTICE files as they see Spark as a popular project and they are a little > sad when the IPMC votes -1 on their

Re: What about additional support on deeply nested data?

2018-06-20 Thread Reynold Xin
Seems like you are also looking for transform and reduce for arrays? https://issues.apache.org/jira/browse/SPARK-23908 https://issues.apache.org/jira/browse/SPARK-23911 On Wed, Jun 20, 2018 at 10:43 AM bobotu wrote: > I store some trajectories data in parquet with this schema: > > create

Re: time for Apache Spark 3.0?

2018-06-15 Thread Reynold Xin
project. It > should be better to drift away the historical burden and focus in new area. > Spark has been widely used all over the world as a successful big data > framework. And it can be better than that. > >> > >> Andy > >> > >> > >> On Thu, Apr

Re: [VOTE] SPIP ML Pipelines in R

2018-06-14 Thread Reynold Xin
+1 on the proposal. On Fri, Jun 1, 2018 at 8:17 PM Hossein wrote: > Hi Shivaram, > > We converged on a CRAN release process that seems identical to current > SparkR. > > --Hossein > > On Thu, May 31, 2018 at 9:10 AM, Shivaram Venkataraman < > shiva...@eecs.berkeley.edu> wrote: > >> Hossein --

Re: Missing HiveConf when starting PySpark from head

2018-06-14 Thread Reynold Xin
The behavior change is not good... On Thu, Jun 14, 2018 at 9:05 AM Li Jin wrote: > Ah, looks like it's this change: > > https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5 > > It seems strange that by default Spark doesn't build

Re: Optimizer rule ConvertToLocalRelation causes expressions to be eager-evaluated in Planning phase

2018-06-08 Thread Reynold Xin
But from the user's perspective, optimization is not run right? So it is still lazy. On Fri, Jun 8, 2018 at 12:35 PM Li Jin wrote: > Hi All, > > Sorry for the long email title. I am a bit surprised to find that the > current optimizer rule "ConvertToLocalRelation" causes expressions to be >

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Reynold Xin
+1 On Fri, Jun 1, 2018 at 3:29 PM Marcelo Vanzin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.3.1. > > Given that I expect at least a few people to be busy with Spark Summit next > week, I'm taking the liberty of setting an extended voting period. The

Re: [VOTE] Spark 2.3.1 (RC3)

2018-06-01 Thread Reynold Xin
Yes everybody please cc the release manager on changes that merit -1. It's high overhead and let's make this smoother. On Fri, Jun 1, 2018 at 1:28 PM Marcelo Vanzin wrote: > Xiao, > > This is the third time in this release cycle that this is happening. > Sorry to single out you guys, but can

Re: [SQL] Purpose of RuntimeReplaceable unevaluable unary expressions?

2018-05-30 Thread Reynold Xin
SQL expressions? On Wed, May 30, 2018 at 11:09 AM Jacek Laskowski wrote: > Hi, > > I've been exploring RuntimeReplaceable expressions [1] and have been > wondering what their purpose is. > > Quoting the scaladoc [2]: > > > An expression that gets replaced at runtime (currently by the optimizer)

Re: Running lint-java during PR builds?

2018-05-21 Thread Reynold Xin
Can we look into if there is a plugin for sbt that works and then we can put everything into one single builder? On Mon, May 21, 2018 at 11:17 AM Dongjoon Hyun wrote: > Thank you for reconsidering this, Hyukjin. :) > > Bests, > Dongjoon. > > > On Mon, May 21, 2018 at

parser error?

2018-05-13 Thread Reynold Xin
Just saw this in one of my PR that's doc only: [error] warning(154): SqlBase.g4:400:0: rule fromClause contains an optional block with at least one alternative that can match an empty string

Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
with several > debug actions. And this would benefit new and experienced users alike. > > Nick > > 2018년 5월 8일 (화) 오후 7:09, Ryan Blue rb...@netflix.com.invalid > <http://mailto:rb...@netflix.com.invalid>님이 작성: > > I've opened SPARK-24215 to track this. >> >>

Re: [DISCUSS] Spark SQL internal data: InternalRow or UnsafeRow?

2018-05-08 Thread Reynold Xin
SQL engine must be > UnsafeRow? > > Personally, I think it makes sense to say that everything should accept > InternalRow, but produce UnsafeRow, with the understanding that UnsafeRow > will usually perform better. > > rb > ​ > > On Tue, May 8, 2018 at 4:09 PM, Reyn

Re: [DISCUSS] Spark SQL internal data: InternalRow or UnsafeRow?

2018-05-08 Thread Reynold Xin
What the internal operators do are strictly internal. To take one step back, is the goal to design an API so the consumers of the API can directly produces what Spark expects internally, to cut down perf cost? On Tue, May 8, 2018 at 1:22 PM Ryan Blue wrote: > While

Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
nd do the > same thing. > > rb > ​ > > On Tue, May 8, 2018 at 3:47 PM, Reynold Xin <r...@databricks.com> wrote: > >> s/underestimated/overestimated/ >> >> On Tue, May 8, 2018 at 3:44 PM Reynold Xin <r...@databricks.com> wrote: >> >>> Marco,

Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
s/underestimated/overestimated/ On Tue, May 8, 2018 at 3:44 PM Reynold Xin <r...@databricks.com> wrote: > Marco, > > There is understanding how Spark works, and there is finding bugs early in > their own program. One can perfectly understand how Spark works and still > fi

Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
t;> explaining there is an error in the .write operation and they are debugging >>> the writing to disk, focusing on that piece of code :) >>> >>> unrelated, but another frequent cause for confusion is cascading errors. >>> like the FetchFailedException. they will be debug

eager execution and debuggability

2018-05-08 Thread Reynold Xin
Similar to the thread yesterday about improving ML/DL integration, I'm sending another email on what I've learned recently from Spark users. I recently talked to some educators that have been teaching Spark in their (top-tier) university classes. They are some of the most important users for

Re: Integrating ML/DL frameworks with Spark

2018-05-08 Thread Reynold Xin
; i.e. > > if some frameworks like xgboost/mxnet requires 50 parallel workers, Spark > is desired to provide a capability to ensure that either we run 50 tasks at > once, or we should quit the complete application/job after some timeout > period > > Best, > > Nan > &

Re: Integrating ML/DL frameworks with Spark

2018-05-08 Thread Reynold Xin
I think that's what Xiangrui was referring to. Instead of retrying a single task, retry the entire stage, and the entire stage of tasks need to be scheduled all at once. On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > >> >>>- Fault tolerance and

Re: Documenting the various DataFrame/SQL join types

2018-05-08 Thread Reynold Xin
Would be great to document. Probably best with examples. On Tue, May 8, 2018 at 6:13 AM Nicholas Chammas wrote: > The documentation for DataFrame.join() > > lists all the

Re: Integrating ML/DL frameworks with Spark

2018-05-07 Thread Reynold Xin
zing the offline discussion! I added a few > comments inline. -Xiangrui > > On Mon, May 7, 2018 at 5:37 PM Reynold Xin <r...@databricks.com> wrote: > >> Hi all, >> >> Xiangrui and I were discussing with a heavy Apache Spark user last week >> on their experience

Integrating ML/DL frameworks with Spark

2018-05-07 Thread Reynold Xin
Hi all, Xiangrui and I were discussing with a heavy Apache Spark user last week on their experiences integrating machine learning (and deep learning) frameworks with Spark and some of their pain points. Couple things were obvious and I wanted to share our learnings with the list. (1) Most

Re: Identifying specific persisted DataFrames via getPersistentRDDs()

2018-05-03 Thread Reynold Xin
Why do you need the underlying RDDs? Can't you just unpersist the dataframes that you don't need? On Mon, Apr 30, 2018 at 8:17 PM Nicholas Chammas wrote: > This seems to be an underexposed part of the API. My use case is this: I > want to unpersist all DataFrames

Re: Process for backports?

2018-04-24 Thread Reynold Xin
1. We don't backport features. 2. In general we don't bump dependencies, unless they are for critical bug fixes. 3. We weight the risk of new regression vs bug fixes. To state the obvious, we wouldn't backport a bug fix if it only affects a very small number of use cases but require very complex

Re: Correlated subqueries in the DataFrame API

2018-04-19 Thread Reynold Xin
Perhaps we can just have a function that turns a DataFrame into a Column? That'd work for both correlated and uncorrelated case, although in the correlated case we'd need to turn off eager analysis (otherwise there is no way to construct a valid DataFrame). On Thu, Apr 19, 2018 at 4:08 PM, Ryan

Scala 2.12 support

2018-04-19 Thread Reynold Xin
Forking the thread to focus on Scala 2.12. Dean, There are couple different issues with Scala 2.12 (closure cleaner, API breaking changes). Which one do you think we can address with a Scala upgrade? (The closure cleaner one I haven't spent a lot of time looking at it but it might involve more

Re: Sorting on a streaming dataframe

2018-04-13 Thread Reynold Xin
Can you describe your use case more? On Thu, Apr 12, 2018 at 11:12 PM Hemant Bhanawat wrote: > Hi Guys, > > Why is sorting on streaming dataframes not supported(unless it is complete > mode)? My downstream needs me to sort the streaming dataframe. > > Hemant >

Re: Maintenance releases for SPARK-23852?

2018-04-11 Thread Reynold Xin
Seems like this would make sense... we usually make maintenance releases for bug fixes after a month anyway. On Wed, Apr 11, 2018 at 12:52 PM, Henry Robinson wrote: > > > On 11 April 2018 at 12:47, Ryan Blue wrote: > >> I think a 1.8.3 Parquet

time for Apache Spark 3.0?

2018-04-04 Thread Reynold Xin
There was a discussion thread on scala-contributors about Apache Spark not yet supporting Scala 2.12, and that got me to think perhaps it is about time for Spark to work towards the 3.0 release. By the

Re: Clarify window behavior in Spark SQL

2018-04-03 Thread Reynold Xin
hub.com/apac > he/spark/pull/5604#discussion_r157931911 > :) > > 2018-04-04 6:27 GMT+08:00 Reynold Xin <r...@databricks.com>: > >> Do other (non-Hive) SQL systems do the same thing? >> >> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier < >> her

Re: Clarify window behavior in Spark SQL

2018-04-03 Thread Reynold Xin
UNDED PRECEDING AND UNBOUNDED >> FOLLOWING. > > > It sort of makes sense if you think about it. If there is no ordering > there is no way to have a bound frame. If there is ordering we default to > the most commonly used deterministic frame. > > > On Tue, Apr 3, 2018 at

Re: Clarify window behavior in Spark SQL

2018-04-03 Thread Reynold Xin
Seems like a bug. On Tue, Apr 3, 2018 at 1:26 PM, Li Jin wrote: > Hi Devs, > > I am seeing some behavior with window functions that is a bit unintuitive > and would like to get some clarification. > > When using aggregation function with window, the frame boundary seems

Re: [build system] experiencing network issues, git fetch timeouts likely

2018-04-02 Thread Reynold Xin
Thanks Shane for taking care of this! On Mon, Apr 2, 2018 at 9:12 PM shane knapp wrote: > the problem was identified and fixed, and we should be good as of about an > hour ago. > > sorry for any inconvenience! > > On Mon, Apr 2, 2018 at 4:15 PM, shane knapp

Re: Hadoop 3 support

2018-04-02 Thread Reynold Xin
blocking issue is really > SPARK-18673. > > > On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin <r...@databricks.com> wrote: > > Does anybody know what needs to be done in order for Spark to support > Hadoop > > 3? > > > > > > -- > Marcelo >

Re: Hadoop 3 support

2018-04-02 Thread Reynold Xin
:50 PM, Mridul Muralidharan <mri...@gmail.com> wrote: > Specifically to run spark with hadoop 3 docker support, I have filed a > few jira's tracked under [1]. > > Regards, > Mridul > > [1] https://issues.apache.org/jira/browse/SPARK-23717 > > > On Mon, Ap

Hadoop 3 support

2018-04-02 Thread Reynold Xin
Does anybody know what needs to be done in order for Spark to support Hadoop 3?

Re: [Spark R] Proposal: Exposing RBackend in RRunner

2018-03-28 Thread Reynold Xin
If you need the functionality I would recommend you just copying the code over to your project and use it that way. On Wed, Mar 28, 2018 at 9:02 AM Felix Cheung wrote: > I think the difference is py4j is a public library whereas the R backend > is specific to SparkR.

Re: Reserved Words in Spark SQL as TableAliases

2018-03-19 Thread Reynold Xin
I agree but the issue was backward compatibility... On Mon, Mar 19, 2018 at 4:02 PM Russell Spitzer wrote: > I found > https://issues.apache.org/jira/browse/SPARK-20964 > > but currently it seems like strictIdentifiers are allowed to contain any > reserved key words >

Re: [Spark][Scheduler] Spark DAGScheduler scheduling performance hindered on JobSubmitted Event

2018-03-06 Thread Reynold Xin
raised about FIFO. We just need to do the planning > outside > > of the scheduler loop. The call site thread sounds like a reasonable > place > > to me. > > > > On Mon, Mar 5, 2018 at 12:56 PM, Reynold Xin <r...@databricks.com> > wrote: > >> > &

Re: [Spark][Scheduler] Spark DAGScheduler scheduling performance hindered on JobSubmitted Event

2018-03-05 Thread Reynold Xin
Rather than using a separate thread pool, perhaps we can just move the prep code to the call site thread? On Sun, Mar 4, 2018 at 11:15 PM, Ajith shetty wrote: > DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted > events has to be processed as

Re: Welcoming some new committers

2018-03-02 Thread Reynold Xin
Congrats and welcome! On Fri, Mar 2, 2018 at 10:41 PM, Matei Zaharia wrote: > Hi everyone, > > The Spark PMC has recently voted to add several new committers to the > project, based on their contributions to Spark 2.3 and other past work: > > - Anirudh Ramanathan

Re: Please keep s3://spark-related-packages/ alive

2018-02-27 Thread Reynold Xin
This was actually an AMPLab bucket. On Feb 27, 2018, 6:04 PM +1300, Holden Karau , wrote: > Thanks Nick, we deprecated this during the roll over to the new release > managers. I assume this bucket was maintained by someone at databricks so > maybe they can chime in. > >

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Reynold Xin
+1 On Feb 20, 2018, 5:51 PM +1300, Sameer Agarwal , wrote: > > > this file shouldn't be included? > > > https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml > > > > I've now deleted this file > > > > > From: Sameer Agarwal

Re: Drop the Hadoop 2.6 profile?

2018-02-08 Thread Reynold Xin
Does it gain us anything to drop 2.6? > On Feb 8, 2018, at 10:50 AM, Sean Owen wrote: > > At this point, with Hadoop 3 on deck, I think hadoop 2.6 is both fairly old, > and actually, not different from 2.7 with respect to Spark. That is, I don't > know if we are actually

Re: data source v2 online meetup

2018-02-01 Thread Reynold Xin
ng with this design. I'd love to see a >> design document that covers why this is a necessary choice (but again, >> separately). >> >> rb >> >> On Thu, Feb 1, 2018 at 9:10 AM, Felix Cheung <felixcheun...@hotmail.com> >> wrote: >> >>>

Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD

2018-01-31 Thread Reynold Xin
You can just do that with mapPartitions pretty easily can’t you? On Wed, Jan 31, 2018 at 11:08 PM Ruifeng Zheng wrote: > HI all: > > > >1, Dataset API supports operation “sortWithinPartitions”, but in > RDD API there is no counterpart (I know there is >

data source v2 online meetup

2018-01-31 Thread Reynold Xin
Data source v2 API is one of the larger main changes in Spark 2.3, and whatever that has already been committed is only the first version and we'd need more work post-2.3 to improve and stablize it. I think at this point we should stop making changes to it in branch-2.3, and instead focus on

Re: [SQL] [Suggestion] Add top() to Dataset

2018-01-30 Thread Reynold Xin
For the DataFrame/Dataset API, the optimizer rewrites orderBy followed by a take into a priority queue based top implementation actually. On Tue, Jan 30, 2018 at 11:10 PM, Yacine Mazari wrote: > Hi All, > > Would it make sense to add a "top()" method to the Dataset API? >

Re: ***UNCHECKED*** [jira] [Resolved] (SPARK-23218) simplify ColumnVector.getArray

2018-01-26 Thread Reynold Xin
I have no idea. Some JIRA update? Might want to file an INFRA ticket. On Fri, Jan 26, 2018 at 10:04 AM, Sean Owen wrote: > This is an example of the "*** UNCHECKED ***" message I was talking about > -- it's part of the email subject rather than JIRA. > > --

Re: What is "*** UNCHECKED ***"?

2018-01-26 Thread Reynold Xin
Examples? On Fri, Jan 26, 2018 at 9:56 AM, Sean Owen wrote: > I probably missed this, but what is the new "*** UNCHECKED ***" message in > the subject line of some JIRAs? >

Re: Spark 3

2018-01-19 Thread Reynold Xin
We can certainly provide a build for Scala 2.12, even in 2.x. On Fri, Jan 19, 2018 at 10:17 AM, Justin Miller < justin.mil...@protectwise.com> wrote: > Would that mean supporting both 2.12 and 2.11? Could be a while before > some of our libraries are off of 2.11. > > Thanks, > Justin > > > On

Re: Kryo 4 serialized form changes -- a problem?

2018-01-19 Thread Reynold Xin
I don’t think Spark relies on Kryo or Java for persistence. User programs might though so it would be great if we can shade it. On Fri, Jan 19, 2018 at 5:55 AM Sean Owen wrote: > See: > > https://issues.apache.org/jira/browse/SPARK-23131 >

Re: Integration testing and Scheduler Backends

2018-01-09 Thread Reynold Xin
If we can actually get our acts together and have integration tests in Jenkins (perhaps not run on every commit but can be run weekly or pre-release smoke tests), that'd be great. Then it relies less on contributors manually testing. On Tue, Jan 9, 2018 at 8:09 AM, Timothy Chen

Re: [SPIP] as-of join in Spark SQL

2018-01-03 Thread Reynold Xin
I've replied on the ticket online ... On Wed, Jan 3, 2018 at 11:41 AM, Li Jin wrote: > Hi community, > > Following instruction on https://spark.apache.org/ > improvement-proposals.html, I'd like to propose a SPIP: as-of join in > Spark SQL. > > Here is the Jira: >

Re: A list of major features in 2.3

2018-01-03 Thread Reynold Xin
It hasn't been compiled yet, but you can look up all the features on JIRA by setting a filter on fixed versions. Usually the release manager compiles the list when it is towards the end of the release cycle (coming up soon). On Mon, Dec 25, 2017 at 10:07 PM, Anoop Saxena

Re: Result obtained before the completion of Stages

2017-12-27 Thread Reynold Xin
Is it possible there is a bug for the UI? If you can run jstack on the executor process to see whether anything is actually running, that can help narrow down the issue. On Tue, Dec 26, 2017 at 10:28 PM ckhari4u wrote: > Hi Reynold, > > I am running a Spark SQL query. > >

Re: Result obtained before the completion of Stages

2017-12-26 Thread Reynold Xin
What did you run? On Tue, Dec 26, 2017 at 10:21 PM, ckhari4u wrote: > Hi Sean, > > Thanks for the reply. I believe I am not facing the scenarios you > mentioned. > > Timestamp conflict: I see the Spark driver logs on the console (tried with > INFO and DEBUG). In all the

Re: [01/51] [partial] spark-website git commit: 2.2.1 generated doc

2017-12-17 Thread Reynold Xin
There is an additional step that's needed to update the symlink, and that step hasn't been done yet. On Sun, Dec 17, 2017 at 12:32 PM, Jacek Laskowski wrote: > Hi Sean, > > What does "Not all the pieces are released yet" mean if you don't mind me > asking? 2.2.1 has already

Re: Decimals

2017-12-13 Thread Reynold Xin
Responses inline On Tue, Dec 12, 2017 at 2:54 AM, Marco Gaido wrote: > Hi all, > > I saw in these weeks that there are a lot of problems related to decimal > values (SPARK-22036, SPARK-22755, for instance). Some are related to > historical choices, which I don't know,

Re: [RESULT][VOTE] Spark 2.2.1 (RC2)

2017-12-01 Thread Reynold Xin
Congrats. On Fri, Dec 1, 2017 at 12:10 AM, Felix Cheung wrote: > This vote passes. Thanks everyone for testing this release. > > > +1: > > Sean Owen (binding) > > Herman van Hövell tot Westerflier (binding) > > Wenchen Fan (binding) > > Shivaram Venkataraman (binding) >

Re: OutputMetrics empty for DF writes - any hints?

2017-11-27 Thread Reynold Xin
Is this due to the insert command not having metrics? It's a problem we should fix. On Mon, Nov 27, 2017 at 10:45 AM, Jason White wrote: > I'd like to use the SparkListenerInterface to listen for some metrics for > monitoring/logging/metadata purposes. The first ones

Re: Faster and Lower memory implementation toPandas

2017-11-16 Thread Reynold Xin
Please send a PR. Thanks for looking at this. On Thu, Nov 16, 2017 at 7:27 AM Andrew Andrade wrote: > Hello devs, > > I know a lot of great work has been done recently with pandas to spark > dataframes and vice versa using Apache Arrow, but I faced a specific pain >

Re: [discuss][SQL] Partitioned column type inference proposal

2017-11-14 Thread Reynold Xin
Most of those thoughts from Wenchen make sense to me. Rather than a list, can we create a table? X-axis is data type, and Y-axis is also data type, and the intersection explains what the coerced type is? Can we also look at what Hive, standard SQL (Postgres?) do? Also, this shouldn't be

Re: how to replace hdfs with a custom distributed fs ?

2017-11-11 Thread Reynold Xin
You can implement the Hadoop FileSystem API for your distributed java fs and just plug into Spark using the Hadoop API. On Sat, Nov 11, 2017 at 9:37 AM, Cristian Lorenzetto < cristian.lorenze...@gmail.com> wrote: > hi i have my distributed java fs and i would like to implement my class > for

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-07 Thread Reynold Xin
The vote has passed with the following +1s: Reynold Xin* Debasish Das Noman Khan Wenchen Fan* Matei Zaharia* Weichen Xu Vaquar Khan Burak Yavuz Xiao Li Tom Graves* Michael Armbrust* Joseph Bradley* Shixiong Zhu* And the following +0s: Sean Owen* Thanks for the feedback! On Wed, Nov 1, 2017

Re: Jenkins upgrade/Test Parallelization & Containerization

2017-11-07 Thread Reynold Xin
My understanding is that AMP actually can provide more resources or adapt changes, while ASF needs to manage 200+ projects and it's hard to accommodate much. I could be wrong though. On Tue, Nov 7, 2017 at 2:14 PM, Holden Karau wrote: > True, I think we've seen that the

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-06 Thread Reynold Xin
esday, November 1, 2017, 11:29:21 AM CDT, Debasish Das < > debasish.da...@gmail.com> wrote: > > > +1 > > Is there any design doc related to API/internal changes ? Will CP be the > default in structured streaming or it's a mode in conjunction with > exisiting behavior. &g

Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Reynold Xin
Why tie a maintenance release to a feature release? They are supposed to be independent and we should be able to make a lot of maintenance releases as needed. On Thu, Nov 2, 2017 at 7:13 PM Sean Owen wrote: > The feature freeze is "mid November" : >

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-01 Thread Reynold Xin
I just replied. On Wed, Nov 1, 2017 at 5:50 PM, Cody Koeninger <c...@koeninger.org> wrote: > Was there any answer to my question around the effect of changes to > the sink api regarding access to underlying offsets? > > On Wed, Nov 1, 2017 at 11:32 AM, Reynold Xin <r...@

[Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-01 Thread Reynold Xin
Earlier I sent out a discussion thread for CP in Structured Streaming: https://issues.apache.org/jira/browse/SPARK-20928 It is meant to be a very small, surgical change to Structured Streaming to enable ultra-low latency. This is great timing because we are also designing and implementing data

<    1   2   3   4   5   6   7   8   9   10   >