Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-11 Thread Reynold Xin
There is no explicit limit but a JVM string cannot be bigger than 2G. It will also at some point run out of memory with too big of a query plan tree or become incredibly slow due to query planning complexity. I've seen queries that are tens of MBs in size. On Thu, Jul 11, 2019 at 5:01 AM, 李书明 <

Revisiting Python / pandas UDF

2019-07-05 Thread Reynold Xin
Hi all, In the past two years, the pandas UDFs are perhaps the most important changes to Spark for Python data science. However, these functionalities have evolved organically, leading to some inconsistencies and confusions among users. I created a ticket and a document summarizing the issues,

Re: Disabling `Merge Commits` from GitHub Merge Button

2019-07-01 Thread Reynold Xin
That's a good idea. We should only be using squash. On Mon, Jul 01, 2019 at 1:52 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > wrote: > > Hi, Apache Spark PMC members and committers. > > > We are using GitHub `Merge Button` in `spark-website` repository > because it's very convenient. > > >

Re: Exposing JIRA issue types at GitHub PRs

2019-06-12 Thread Reynold Xin
Seems like a good idea. Can we test this with a component first? On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun wrote: > Hi, All. > > Since we use both Apache JIRA and GitHub actively for Apache Spark > contributions, we have lots of JIRAs and PRs consequently. One specific > thing I've been long

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Reynold Xin
+1 on Xiangrui’s plan. On Thu, May 30, 2019 at 7:55 AM shane knapp wrote: > I don't have a good sense of the overhead of continuing to support >> Python 2; is it large enough to consider dropping it in Spark 3.0? >> >> from the build/test side, it will actually be pretty easy to continue > suppo

Re: [RESULT][VOTE] SPIP: Public APIs for extended Columnar Processing Support

2019-05-29 Thread Reynold Xin
Thanks Tom. I finally had time to look at the updated SPIP 10 mins ago. I support the high level idea and +1 on the SPIP. That said, I think the proposed API is too complicated and invasive change to the existing internals. A much simpler API would be to expose a columnar batch iterator interf

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-25 Thread Reynold Xin
Can we push this to June 1st? I have been meaning to read it but unfortunately keeps traveling... On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun wrote: > +1 > > Thanks, > Dongjoon. > > On Fri, May 24, 2019 at 17:03 DB Tsai wrote: > >> +1 on exposing the APIs for columnar processing support. >> >

Re: Interesting implications of supporting Scala 2.13

2019-05-11 Thread Reynold Xin
> Interested in thoughts on how to proceed on something like this, as there > will probably be a few more similar issues. > > > > On Fri, May 10, 2019 at 3:32 PM Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > >> >> >>

Re: Interesting implications of supporting Scala 2.13

2019-05-10 Thread Reynold Xin
d. I failed and gave up. >> >> >> At some point maybe we figure out whether we can remove the SBT-based >> build if it's super painful, but only if there's not much other choice. >> That is for a future thread. >> >> >> >> On

Re: Interesting implications of supporting Scala 2.13

2019-05-10 Thread Reynold Xin
Looks like a great idea to make changes in Spark 3.0 to prepare for Scala 2.13 upgrade. Are there breaking changes that would require us to have two different source code for 2.12 vs 2.13? On Fri, May 10, 2019 at 11:41 AM, Sean Owen < sro...@gmail.com > wrote: > > > > While that's not happe

Re: [VOTE] Release Apache Spark 2.4.2

2019-04-26 Thread Reynold Xin
I do feel it'd be better to not switch default Scala versions in a minor release. I don't know how much downstream this impacts. Dotnet is a good data point. Anybody else hit this issue? On Thu, Apr 25, 2019 at 11:36 PM, Terry Kim < yumin...@gmail.com > wrote: > > > > Very much interested in

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Reynold Xin
"if others think it would be helpful, we can cancel this vote, update the SPIP to clarify exactly what I am proposing, and then restart the vote after we have gotten more agreement on what APIs should be exposed" That'd be very useful. At least I was confused by what the SPIP was about. No poin

Re: Spark 2.4.2

2019-04-17 Thread Reynold Xin
normally wouldn't backport, except that I've heard a > few times about concerns about CVEs affecting Databind, so wondering > who else out there might have an opinion. I'm not pushing for it > necessarily. > > On Wed, Apr 17, 2019 at 6:18 PM Reynold Xin wrote: > > >

Re: Spark 2.4.2

2019-04-17 Thread Reynold Xin
For Jackson - are you worrying about JSON parsing for users or internal Spark functionality breaking? On Wed, Apr 17, 2019 at 6:02 PM Sean Owen wrote: > There's only one other item on my radar, which is considering updating > Jackson to 2.9 in branch-2.4 to get security fixes. Pros: it's come up

Re: pyspark.sql.functions ide friendly

2019-04-17 Thread Reynold Xin
Are you talking about the ones that are defined in a dictionary? If yes, that was actually not that great in hindsight (makes it harder to read & change), so I'm OK changing it. E.g. _functions = {     'lit': _lit_doc,     'col': 'Returns a :class:`Column` based on the given column name.',  

Re: [DISCUSS] Spark Columnar Processing

2019-04-11 Thread Reynold Xin
gt; Do you have design doc? I'm also interested in this topic and want to help >>>> contribute. >>>> >>>> On Tue, Apr 2, 2019 at 10:00 PM Bobby Evans < bobby@ apache. org ( >>>> bo...@apache.org ) > wrote: >>>> >>>>

Re: [DISCUSS] Spark Columnar Processing

2019-04-01 Thread Reynold Xin
I just realized I didn't make it very clear my stance here ... here's another try: I think it's a no brainer to have a good columnar UDF interface. This would facilitate a lot of high performance applications, e.g. GPU-based accelerations for machine learning algorithms. On rewriting the entir

Do you use single-quote syntax for the DataFrame API?

2019-03-30 Thread Reynold Xin
As part of evolving the Scala language, the Scala team is considering removing single-quote syntax for representing symbols. Single-quote syntax is one of the ways to represent a column in Spark's DataFrame API. While I personally don't use them (I prefer just using strings for column names, or

Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-03-29 Thread Reynold Xin
We tried enabling blacklisting for some customers and in the cloud, very quickly they end up having 0 executors due to various transient errors. So unfortunately I think the current implementation is terrible for cloud deployments, and shouldn't be on by default. The heart of the issue is that t

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Reynold Xin
of DataType classes. >> >> >> All of these options are likely to have implications for the catalyst >> systems. I'm not sure if they are minor more substantial. >> >> >> On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin < rxin@ databricks. com ( >

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Reynold Xin
All of these options are likely to have implications for the catalyst > systems. I'm not sure if they are minor more substantial. > > > On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > >> Yes this is known a

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Reynold Xin
Yes this is known and an issue for performance. Do you have any thoughts on how to fix this? On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson wrote: > I describe some of the details here: > https://issues.apache.org/jira/browse/SPARK-27296 > > The short version of the story is that aggregating dat

Re: [DISCUSS] Spark Columnar Processing

2019-03-26 Thread Reynold Xin
26% improvement is underwhelming if it requires massive refactoring of the codebase. Also you can't just add the benefits up this way, because: - Both vectorization and codegen reduces the overhead in virtual function calls - Vectorization code is more friendly to compilers / CPUs, but requires

Re: PySpark syntax vs Pandas syntax

2019-03-26 Thread Reynold Xin
n Kwon < gurwls223@ gmail. com ( > gurwls...@gmail.com ) > wrote: > > >> BTW, I am working on the documentation related with this subject at https:/ >> / issues. apache. org/ jira/ browse/ SPARK-26022 ( >> https://issues.apache.org/jira/browse/SPARK-26022 ) to desc

Re: PySpark syntax vs Pandas syntax

2019-03-25 Thread Reynold Xin
s working on it - I'd prefer > collaborating. > > Note - I'm not recommending we make the logical plan mutable (as I am > scared of that too!). I think there are other ways of handling that - but > we can go into details later. > > On Tue, Mar 26, 2019 at 11:58 AM R

Re: PySpark syntax vs Pandas syntax

2019-03-25 Thread Reynold Xin
We have been thinking about some of these issues. Some of them are harder to do, e.g. Spark DataFrames are fundamentally immutable, and making the logical plan mutable is a significant deviation from the current paradigm that might confuse the hell out of some users. We are considering building a s

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread Reynold Xin
At some point we should celebrate having the larger RC number ever in Spark ... On Mon, Mar 25, 2019 at 9:44 PM, DB Tsai < dbt...@dbtsai.com.invalid > wrote: > > > > RC9 was just cut. Will send out another thread once the build is finished. > > > > > Sincerely, > > > > DB Tsai > ---

Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-25 Thread Reynold Xin
+1 on doing this in 3.0. On Mon, Mar 25, 2019 at 9:31 PM, Felix Cheung < felixcheun...@hotmail.com > wrote: > > I’m +1 if 3.0 > > > >   > *From:* Sean Owen < srowen@ gmail. com ( sro...@gmail.com ) > > *Sent:* Monday, March 25, 2019 6:48 PM > *To:* Hyukjin Kwon > *Cc:* dev; Bryan Cutler; Tak

Re: understanding the plans of spark sql

2019-03-18 Thread Reynold Xin
This is more of a question for the connector. It depends on how the connector is implemented. Some implements aggregate pushdown, but most don't. On Mon, Mar 18, 2019 at 10:05 AM, asma zgolli < zgollia...@gmail.com > wrote: > > Hello, > > > I'm executing using spark SQL an SQL workload on dat

Re: Benchmark Java/Scala/Python for Apache spark

2019-03-11 Thread Reynold Xin
If you use UDFs in Python, you would want to use Pandas UDF for better performance. On Mon, Mar 11, 2019 at 7:50 PM Jonathan Winandy wrote: > Thanks, I didn't know! > > That being said, any udf use seems to affect badly code generation (and > the performance). > > > On Mon, 11 Mar 2019, 15:13 Dy

Re: [SQL] hash: 64-bits and seeding

2019-03-06 Thread Reynold Xin
Rather than calling it hash64, it'd be better to just call it xxhash64. The reason being ten years from now, we probably would look back and laugh at a specific hash implementation. It'd be better to just name the expression what it is. On Wed, Mar 06, 2019 at 7:59 PM, < huon.wil...@data61.csir

Re: Hive Hash in Spark

2019-03-06 Thread Reynold Xin
I think they might be used in bucketing? Not 100% sure. On Wed, Mar 06, 2019 at 1:40 PM, < tcon...@gmail.com > wrote: > > > > Hi, > > > >   > > > > I noticed the existence of a Hive Hash partitioning implementation in > Spark, but also noticed that it’s not being used, and that the Spark

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

2019-03-06 Thread Reynold Xin
nyone is free to take on this, but I have no experience with R. > > >   > > > > If you folks agree with this, let us know, so we can move forward with the > merge. > > > >   > > > > Best. > > > >   > > > > -- André. > &

Re: [VOTE] SPIP: Spark API for Table Metadata

2019-03-01 Thread Reynold Xin
reement that the intent >> is not to make the exact names binding, we should be okay. >> >> >> I can remove the user-facing API sketch, but I'd prefer to leave it in the >> sketch section so we have it documented somewhere. >> >> On Fri, Mar 1, 20

Re: [VOTE] SPIP: Spark API for Table Metadata

2019-03-01 Thread Reynold Xin
Ryan - can you take the public user facing API part out of that SPIP? In general it'd be better to have the SPIPs be higher level, and put the detailed APIs in a separate doc. Alternatively, put them in the SPIP but explicitly vote on the high level stuff and not the detailed APIs.  I don't wan

Re: CombinePerKey and GroupByKey

2019-02-28 Thread Reynold Xin
This should be fine. Dataset.groupByKey is a logical operation, not a physical one (as in Spark wouldn’t always materialize all the groups in memory). On Thu, Feb 28, 2019 at 1:46 AM Etienne Chauchot wrote: > Hi all, > > I'm migrating RDD pipelines to Dataset and I saw that Combine.PerKey is no

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Reynold Xin
We will have to fix that before we declare dev2 is stable, because InternalRow is not a stable API. We don’t necessarily need to do it in 3.0. On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah wrote: > Will that then require an API break down the line? Do we save that for > Spark 4? > > > > -Matt Cheah

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

2019-02-24 Thread Reynold Xin
The challenge with the Scala/Java API in the past is that when there are multipe parameters, it'd lead to an explosion of function overloads.  On Sun, Feb 24, 2019 at 3:22 PM, Felix Cheung < felixcheun...@hotmail.com > wrote: > > I hear three topics in this thread > > > 1. I don’t think we s

Re: [DISCUSS] SPIP: Relational Cache

2019-02-24 Thread Reynold Xin
How is this different from materialized views? On Sun, Feb 24, 2019 at 3:44 PM Daoyuan Wang wrote: > Hi everyone, > > We'd like to discuss our proposal of Spark relational cache in this > thread. Spark has native command for RDD caching, but the use of CACHE > command in Spark SQL is limited, as

Re: merge script stopped working; Python 2/3 input() issue?

2019-02-15 Thread Reynold Xin
lol On Fri, Feb 15, 2019 at 4:02 PM, Marcelo Vanzin < van...@cloudera.com.invalid > wrote: > > > > You're talking about the spark-website script, right? The main repo's > script has been working for me, the website one is broken. > > > > I think it was caused by this dude changing raw_inpu

Re: [build system] speeding up maven build building only changed modules compared to master branch

2019-01-28 Thread Reynold Xin
This might be useful to do. BTW, based on my experience with different build systems in the past few years (extensively SBT/Maven/Bazel, and to a less extent Gradle/Cargo), I think the longer term solution is to move to Bazel. It is so much easier to understand and use, and also much more featu

Re: Make .unpersist() non-blocking by default?

2019-01-28 Thread Reynold Xin
Seems to make sense to have it false by default. (I agree this deserves a dev list mention though even if there is easy consensus). We should make sure we mark the Jira with releasenotes so we can add it to uograde guide. On Mon, Jan 28, 2019 at 8:47 AM Sean Owen wrote: > Interesting notion at

Re: [PySpark] Revisiting PySpark type annotations

2019-01-25 Thread Reynold Xin
If we can make the annotation compatible with Python 2, why don’t we add type annotation to make life easier for users of Python 3 (with type)? On Fri, Jan 25, 2019 at 7:53 AM Maciej Szymkiewicz wrote: > > Hello everyone, > > I'd like to revisit the topic of adding PySpark type annotations in 3.

Re: Removing old HiveMetastore(0.12~0.14) from Spark 3.0.0?

2019-01-23 Thread Reynold Xin
ns of Hive metastore. Feel >> free to ping me if we hit any issue about it. >> >> Cheers, >> >> Xiao >> >> Reynold Xin 于2019年1月22日周二 下午11:18写道: >> >>> Actually a non trivial fraction of users / customers I interact with >>> still us

Re: Removing old HiveMetastore(0.12~0.14) from Spark 3.0.0?

2019-01-22 Thread Reynold Xin
Actually a non trivial fraction of users / customers I interact with still use very old Hive metastores. Because it’s very difficult to upgrade Hive metastore wholesale (it’d require all the production jobs that access the same metastore be upgraded at once). This is even harder than JVM upgrade

Re: Make proactive check for closure serializability optional?

2019-01-22 Thread Reynold Xin
com ( sro...@gmail.com ) > > *Sent:* Monday, January 21, 2019 10:42 AM > *To:* Reynold Xin > *Cc:* dev > *Subject:* Re: Make proactive check for closure serializability optional? >   > None except the bug / PR I linked to, which is really just a bug in > the RowMatrix implementati

Re: Make proactive check for closure serializability optional?

2019-01-21 Thread Reynold Xin
Did you actually observe a perf issue? On Mon, Jan 21, 2019 at 10:04 AM Sean Owen wrote: > The ClosureCleaner proactively checks that closures passed to > transformations like RDD.map() are serializable, before they're > executed. It does this by just serializing it with the JavaSerializer. > >

Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread Reynold Xin
BTW the largest change to SS right now is probably the entire data source API v2 effort, which aims to unify streaming and batch from data source perspective, and provide a reliable, expressive source/sink API. On Mon, Jan 14, 2019 at 5:34 PM, Reynold Xin < r...@databricks.com >

Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread Reynold Xin
There are a few things to keep in mind: 1. Structured Streaming isn't an independent project. It actually (by design) depends on all the rest of Spark SQL, and virtually all improvements to Spark SQL benefit Structured Streaming. 2. The project as far as I can tell is relatively mature for core

Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-13 Thread Reynold Xin
Thanks for writing this up. Just to show why option 1 is not sufficient. MySQL and Postgres are the two most popular open source database systems, and both support database → schema → table 3 part identification, so Spark supporting only 2 part name passing to the data source (option 1) isn't su

Re: [DISCUSS] Handling correctness/data loss jiras

2019-01-04 Thread Reynold Xin
Committers, When you merge tickets fixing correctness bugs, please make sure you tag the tickets with "correctness" label. I've found multiple tickets today that didn't do that. On Fri, Aug 17, 2018 at 7:11 AM, Tom Graves < tgraves...@yahoo.com.invalid > wrote: > > Since we haven't heard any

Re: Remove non-Tungsten mode in Spark 3?

2019-01-03 Thread Reynold Xin
The issue with the offheap mode is it is a pretty big behavior change and does require additional setup (also for users that run with UDFs that allocate a lot of heap memory, it might not be as good). I can see us removing the legacy mode since it's been legacy for a long time and perhaps very

Re: Trigger full GC during executor idle time?

2018-12-31 Thread Reynold Xin
Not sure how reputable or representative that paper is... On Mon, Dec 31, 2018 at 10:57 AM Sean Owen wrote: > https://github.com/apache/spark/pull/23401 > > Interesting PR; I thought it was not worthwhile until I saw a paper > claiming this can speed things up to the tune of 2-6%. Has anyone > c

Re: [DISCUSS] Default values and data sources

2018-12-21 Thread Reynold Xin
I'd only do any of the schema evolution things as add-on on top. This is an extremely complicated area and we could risk never shipping anything because there would be a lot of different requirements. On Fri, Dec 21, 2018 at 9:46 AM, Russell Spitzer < russell.spit...@gmail.com > wrote: > > I

Re: Noisy spark-website notifications

2018-12-19 Thread Reynold Xin
I added my comment there too! On Wed, Dec 19, 2018 at 7:26 PM, Hyukjin Kwon < gurwls...@gmail.com > wrote: > > Yea, that's a bit noisy .. I would just completely disable it to be > honest. I failed https:/ / issues. apache. org/ jira/ browse/ INFRA-17469 ( > https://issues.apache.org/jira/browse

Re: Noisy spark-website notifications

2018-12-19 Thread Reynold Xin
I think there is an infra ticket open for it right now. On Wed, Dec 19, 2018 at 6:58 PM Nicholas Chammas wrote: > Can we somehow disable these new email alerts coming through for the Spark > website repo? > > On Wed, Dec 19, 2018 at 8:25 PM GitBox wrote: > >> ueshin commented on a change in pul

Re: [build system] jenkins master needs reboot, temporary downtime

2018-12-19 Thread Reynold Xin
Thanks for taking care of this, Shane! On Wed, Dec 19, 2018 at 9:45 AM, shane knapp < skn...@berkeley.edu > wrote: > > master is back up and building. > > On Wed, Dec 19, 2018 at 9:31 AM shane knapp < sknapp@ berkeley. edu ( > skn...@berkeley.edu ) > wrote: > > >> the jenkins process seems to

Re: Decimals with negative scale

2018-12-18 Thread Reynold Xin
@gmail.com > wrote: > > This is at analysis time. > > On Tue, 18 Dec 2018, 17:32 Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) wrote: > > >> Is this an analysis time thing or a runtime thing? >> >> On Tue, Dec 18, 2018 at 7:45 AM Mar

Re: Decimals with negative scale

2018-12-18 Thread Reynold Xin
Is this an analysis time thing or a runtime thing? On Tue, Dec 18, 2018 at 7:45 AM Marco Gaido wrote: > Hi all, > > as you may remember, there was a design doc to support operations > involving decimals with negative scales. After the discussion in the design > doc, now the related PR is blocked

Re: [DISCUSS] Function plugins

2018-12-14 Thread Reynold Xin
easily consumed by a UDF? > > > > Otherwise +1 for trying to get this to work without Hive. I think even > having something without codegen and optimized row formats is worthwhile if > only because it’s easier to use than Hive UDFs. > > > > -Matt Cheah > > > > *

Re: [DISCUSS] Function plugins

2018-12-14 Thread Reynold Xin
Having a way to register UDFs that are not using Hive APIs would be great! On Fri, Dec 14, 2018 at 1:30 PM, Ryan Blue < rb...@netflix.com.invalid > wrote: > > > > Hi everyone, > I’ve been looking into improving how users of our Spark platform register > and use UDFs and I’d like to discuss a f

removing most of the config functions in SQLConf?

2018-12-13 Thread Reynold Xin
In SQLConf, for each config option, we declare them in two places: First in the SQLConf object, e.g.: *val* CSV_PARSER_COLUMN_PRUNING = buildConf ( "spark.sql.csv.parser.columnPruning.enabled" ) .internal() .doc( "If it is set to true, column names of the requested schema are passed to CSV pa

dsv2 remaining work

2018-12-12 Thread Reynold Xin
Unfortunately I can't make it to the DSv2 sync today. Sending an email with my thoughts instead. I spent a few hours thinking about this. It's evident that progress has been slow, because this is an important API and people from different perspectives have very different requirements, and the pr

Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-11 Thread Reynold Xin
A-17385 ( https://issues.apache.org/jira/browse/INFRA-17385 ) but no > follow-up. Go ahead and open a new INFRA ticket. > > On Tue, Dec 11, 2018 at 6:20 PM Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > >> Thanks, Sean. Which INFRA ticket is it? It'

Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-11 Thread Reynold Xin
Thanks, Sean. Which INFRA ticket is it? It's creating a lot of noise so I want to put some pressure myself there too. On Mon, Dec 10, 2018 at 9:51 AM, Sean Owen < sro...@apache.org > wrote: > > > > Agree, I'll ask on the INFRA ticket and follow up. That's a lot of extra > noise. > > > > On

Re: which classes/methods are considered as private in Spark?

2018-11-13 Thread Reynold Xin
I used to, before each release during the RC phase, go through every single doc page to make sure we don’t unintentionally leave things public. I no longer have time to do that unfortunately. I find that very useful because I always catch some mistakes through organic development. > On Nov 13,

Re: time for Apache Spark 3.0?

2018-11-12 Thread Reynold Xin
sed >breaking changes / JIRA tickets? Perhaps we can include it in the JIRA >ticket that can be filtered down to somehow? > > > > Thanks, > > > > -Matt Cheah > > *From: *Vinoo Ganesh > *Date: *Monday, November 12, 2018 at 2:48 PM > *To: *Reynold Xin

Re: time for Apache Spark 3.0?

2018-11-12 Thread Reynold Xin
PM Vinoo Ganesh wrote: > Quickly following up on this – is there a target date for when Spark 3.0 > may be released and/or a list of the likely api breaks that are > anticipated? > > > > *From: *Xiao Li > *Date: *Saturday, September 29, 2018 at 02:09 > *To: *Reynold

Re: DataSourceV2 capability API

2018-11-09 Thread Reynold Xin
t a concern. When we > add a capability, we add handling for it that old versions wouldn't be able > to use anyway. The advantage is that we don't have to treat all sources the > same. > > On Fri, Nov 9, 2018 at 11:32 AM Reynold Xin wrote: > >> How do we deal with

Re: DataSourceV2 capability API

2018-11-09 Thread Reynold Xin
elix Cheung > wrote: > >> One question is where will the list of capability strings be defined? >> >> >> -- >> *From:* Ryan Blue >> *Sent:* Thursday, November 8, 2018 2:09 PM >> *To:* Reynold Xin >> *Cc:* Spark D

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Reynold Xin
Do you have a cached copy? I see it here http://spark.apache.org/downloads.html On Thu, Nov 8, 2018 at 4:12 PM Li Gao wrote: > this is wonderful ! > I noticed the official spark download site does not have 2.4 download > links yet. > > On Thu, Nov 8, 2018, 4:11 PM Swapnil Shinde wrote: > >>

Re: DataSourceV2 capability API

2018-11-08 Thread Reynold Xin
This is currently accomplished by having traits that data sources can extend, as well as runtime exceptions right? It's hard to argue one way vs another without knowing how things will evolve (e.g. how many different capabilities there will be). On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue wrote:

Did the 2.4 release email go out?

2018-11-08 Thread Reynold Xin
The website is already up but I didn’t see any email announcement yet.

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-06 Thread Reynold Xin
Have we deprecated Scala 2.11 already in an existing release? On Tue, Nov 6, 2018 at 4:43 PM DB Tsai wrote: > Ideally, supporting only Scala 2.12 in Spark 3 will be ideal. > > DB Tsai | Siri Open Source Technologies [not a contribution] |  > Apple, Inc > > > On Nov 6, 2018, at 2:55 PM, Feli

Re: Test and support only LTS JDK release?

2018-11-06 Thread Reynold Xin
What does OpenJDK do and other non-Oracle VMs? I know there was a lot of discussions from Redhat etc to support. On Tue, Nov 6, 2018 at 11:24 AM DB Tsai wrote: > Given Oracle's new 6-month release model, I feel the only realistic option > is to only test and support JDK such as JDK 11 LTS and f

Re: Removing non-deprecated R methods that were deprecated in Python, Scala?

2018-11-06 Thread Reynold Xin
Maybe deprecate and remove in next version? It is bad to just remove a method without deprecation notice. On Tue, Nov 6, 2018 at 5:44 AM Sean Owen wrote: > See https://github.com/apache/spark/pull/22921#discussion_r230568058 > > Methods like toDegrees, toRadians, approxCountDistinct were 'rename

Re: [VOTE] SPARK 2.4.0 (RC5)

2018-10-31 Thread Reynold Xin
+1 Look forward to the release! On Mon, Oct 29, 2018 at 3:22 AM Wenchen Fan wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.4.0. > > The vote is open until November 1 PST and passes if a majority +1 PMC > votes are cast, with > a minimum of 3 +1 votes. >

Re: Helper methods for PySpark discussion

2018-10-28 Thread Reynold Xin
I agree - it is very easy for users to shoot themselves in the foot if we don't put in the safeguards, or mislead them by giving them the impression that operations are cheap. DataFrame in Spark isn't like a single node in-memory data structure. Note that the repr string work is very different. Th

Re: Drop support for old Hive in Spark 3.0?

2018-10-26 Thread Reynold Xin
People do use it, and the maintenance cost is pretty low so I don't think we should just drop it. We can be explicit about there are not a lot of developments going on and we are unlikely to add a lot of new features to it, and users are also welcome to use other JDBC/ODBC endpoint implementations

Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Reynold Xin
I also think we should get this in: https://github.com/apache/spark/pull/22841 It's to deprecate a confusing & broken window function API, so we can remove them in 3.0 and redesign a better one. See https://issues.apache.org/jira/browse/SPARK-25841 for more information. On Thu, Oct 25, 2018 at 4

Re: DataSourceV2 hangouts sync

2018-10-25 Thread Reynold Xin
+1 On Thu, Oct 25, 2018 at 4:12 PM Li Jin wrote: > Although I am not specifically involved in DSv2, I think having this kind > of meeting is definitely helpful to discuss, move certain effort forward > and keep people on the same page. Glad to see this kind of working group > happening. > > On

Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-25 Thread Reynold Xin
I have some pretty serious concerns over this proposal. I agree that there are many things that can be improved, but at the same time I also think the cost of introducing a new IR in the middle is extremely high. Having participated in designing some of the IRs in other systems, I've seen more fail

Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-10-25 Thread Reynold Xin
e could argue that the litany of the questions are really a >> double-click on the essence: why, what, how. The three interrogatives ought >> to be the essence and distillation of any proposal or technical exposition. >> >> Cheers >> Jules >> >> Sent from

Re: some doubt on code understanding

2018-10-17 Thread Reynold Xin
Rounding. On Wed, Oct 17, 2018 at 6:25 PM Sandeep Katta < sandeep0102.opensou...@gmail.com> wrote: > Hi Guys, > > I am trying to understand structured streaming code flow by doing so I > came across below code flow > > def nextBatchTime(now: Long): Long = { > if (intervalMs == 0) now else now /

Re: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Reynold Xin
We shouldn’t merge new features into release branches anymore. On Tue, Oct 16, 2018 at 6:32 PM Rob Vesse wrote: > Right now the Kerberos support for Spark on K8S is only on master AFAICT > i.e. the feature is not present on branch-2.4 > > > > Therefore I don’t see any point in adding the tests i

Re: Remove Flume support in 3.0.0?

2018-10-11 Thread Reynold Xin
Sounds like a good idea... > On Oct 11, 2018, at 6:40 PM, Sean Owen wrote: > > Yep, that already exists as Bahir. > Also, would anyone object to declaring Flume support at least > deprecated in 2.4.0? >> On Wed, Oct 10, 2018 at 2:29 PM Jörn Franke wrote: >> >> I think it makes sense to remove

Re: Random sampling in tests

2018-10-08 Thread Reynold Xin
the seed value and we add >> the seed name in the test case name. This can help us reproduce it. >> >> Xiao >> >> On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin wrote: >> >>> I'm personally not a big fan of doing it that way in the PR. It is >>>

Re: Random sampling in tests

2018-10-08 Thread Reynold Xin
I'm personally not a big fan of doing it that way in the PR. It is perfectly fine to employ randomized tests, and in this case it might even be fine to just pick couple different timezones like the way it happened in the PR, but we should: 1. Document in the code comment why we did it that way. 2

Re: Back to SQL

2018-10-03 Thread Reynold Xin
No we used to have that (for views) but it wasn’t working well enough so we removed it. On Wed, Oct 3, 2018 at 6:41 PM Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi everyone, > Is there any known way to go from a Spark SQL Logical Plan (optimised ?) > Back to a SQL query ? > > R

welcome a new batch of committers

2018-10-03 Thread Reynold Xin
Hi all, The Apache Spark PMC has recently voted to add several new committers to the project, for their contributions: - Shane Knapp (contributor to infra) - Dongjoon Hyun (contributor to ORC support and other parts of Spark) - Kazuaki Ishizaki (contributor to Spark SQL) - Xingbo Jiang (contribut

Re: time for Apache Spark 3.0?

2018-09-28 Thread Reynold Xin
getting >> everything right before we see the results of the new API being more widely >> used, and too much cost in maintaining until the next major release >> something that we come to regret for us to create new API in a fully frozen >> state. >> > >&

Re: Adding Extension to Load Custom functions into Thriftserver/SqlShell

2018-09-27 Thread Reynold Xin
Thoughts on how the api would look like? On Thu, Sep 27, 2018 at 11:13 AM Russell Spitzer wrote: > While that's easy for some users, we basically want to load up some > functions by default into all session catalogues regardless of who made > them. We do this with certain rules and strategies us

Re: Support for Second level of concurrency

2018-09-25 Thread Reynold Xin
That’s a pretty major architectural change and would be extremely difficult to do at this stage. On Tue, Sep 25, 2018 at 9:31 AM sandeep mehandru wrote: > Hi Folks, > >There is a use-case , where we are doing large computation on two large > vectors. It is basically a scenario, where we run

***UNCHECKED*** Re: Re: Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Reynold Xin
We also only block if it is a new regression. On Wed, Sep 19, 2018 at 12:18 AM Saisai Shao wrote: > Hi Marco, > > From my understanding of SPARK-25454, I don't think it is a block issue, > it might be an corner case, so personally I don't want to block the release > of 2.3.2 because of this issu

Re: [DISCUSS] upper/lower of special characters

2018-09-18 Thread Reynold Xin
I'd just document it as a known limitation and move on for now, until there are enough end users that need this. Spark is also very powerful with UDFs and end users can easily work around this using UDFs. -- excuse the brevity and lower case due to wrist injury On Tue, Sep 18, 2018 at 11:14 PM s

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Reynold Xin
i'd like to second that. if we want to communicate timeline, we can add to the release notes saying py2 will be deprecated in 3.0, and removed in a 3.x release. -- excuse the brevity and lower case due to wrist injury On Mon, Sep 17, 2018 at 4:24 PM Matei Zaharia wrote: > That’s a good point

Re: Python friendly API for Spark 3.0

2018-09-16 Thread Reynold Xin
Most of those are pretty difficult to add though, because they are fundamentally difficult to do in a distributed setting and with lazy execution. We should add some but at some point there are fundamental differences between the underlying execution engine that are pretty difficult to reconcile.

Re: from_csv

2018-09-15 Thread Reynold Xin
makes sense - i'd make this as consistent as to_json / from_json as possible. how would this work in sql? i.e. how would passing options in work? -- excuse the brevity and lower case due to wrist injury On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk wrote: > Hi All, > > I would like to propose ne

Re: Python friendly API for Spark 3.0

2018-09-15 Thread Reynold Xin
we can also declare python 2 as deprecated and drop it in 3.x, not necessarily 3.0. -- excuse the brevity and lower case due to wrist injury On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson wrote: > I am probably splitting hairs to finely, but I was considering the > difference between improvem

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Reynold Xin
t be going to be duplicated. > > Ryan replied me as Iceberg and HBase MVCC timestamps can enable us to > implement "commit" (his reply didn't hit dev. mailing list though) but I'm > not an expert of both twos and I couldn't still imagine it can deal with > v

<    1   2   3   4   5   6   7   8   9   10   >