Re: [VOTE] Release Spark 2.4.7 (RC1)

2020-08-19 Thread Wenchen Fan
I think so. I don't see other bug reports for 2.4. On Thu, Aug 20, 2020 at 12:11 AM Nicholas Marion wrote: > It appears all 3 issues slated for Spark 2.4.7 have been merged. Should we > be looking at getting RC2 ready? > > > Regards, > > *NICHOLAS T. MARION * > IBM Open Data Analytics for z/OS -

Re: [SparkSql] Casting of Predicate Literals

2020-08-19 Thread Wenchen Fan
. > CAST(short_col AS LONG) < 1000, can we still push down "short_col < 1000" > without the cast? > > On Tue, Aug 4, 2020 at 6:55 PM Russell Spitzer > wrote: > >> Thanks! That's exactly what I was hoping for! Thanks for finding the Jira >> for m

Re: SPIP: Catalog API for view metadata

2020-08-18 Thread Wenchen Fan
ly update view schema (even >> though executing the view in Hive results in data that has the most recent >> schema when underlying tables evolve -- so newly added nested field data >> shows up in the view evaluation query result but not in the view schema). >> >>

Re: SPIP: Catalog API for view metadata

2020-08-14 Thread Wenchen Fan
>>"dual" catalog. >>>>>- The implementation for a "dual" catalog plugin should ensure: >>>>> - Creating a view in view catalog when a table of the same >>>>> name exists should fail. >>>>> - Creati

Re: SPIP: Catalog API for view metadata

2020-08-12 Thread Wenchen Fan
Hi John, Thanks for working on this! View support is very important to the catalog plugin API. After reading your doc, I have one high-level question: should view be a separated API or it's just a special type of table? AFAIK in most databases, tables and views share the same namespace. You can'

Re: [SparkSql] Casting of Predicate Literals

2020-08-04 Thread Wenchen Fan
I think this is not a problem in 3.0 anymore, see https://issues.apache.org/jira/browse/SPARK-27638 On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer wrote: > I've just run into this issue again with another user and I feel like most > folks here have seen some flavor of this at some point. > > Th

Re: [VOTE] Update the committer guidelines to clarify when to commit changes.

2020-07-30 Thread Wenchen Fan
+1, thanks for driving it, Holden! On Fri, Jul 31, 2020 at 10:24 AM Holden Karau wrote: > +1 from myself :) > > On Thu, Jul 30, 2020 at 2:53 PM Jungtaek Lim > wrote: > >> +1 (non-binding, I guess) >> >> Thanks for raising the issue and sorting it out! >> >> On Fri, Jul 31, 2020 at 6:47 AM Holde

Re: InterpretedUnsafeProjection - error in getElementSize

2020-07-24 Thread Wenchen Fan
Can you create a JIRA ticket? There are many people happy to help to fix it. On Tue, Jul 21, 2020 at 9:49 PM Janda Martin wrote: > Hi, > I think that I found error in > InterpretedUnsafeProjection::getElementSize. This method differs from > similar implementation in GenerateUnsafeProjection. >

Re: Catalog API for Partition

2020-07-20 Thread Wenchen Fan
Yea we don't want the partitions to be Hive-specific. My point is, we call it "Partition Catalog APIs", which makes me confused about the relationship between this and the "partitions" in `TableCatalog.createTable`. Are these two orthogonal? Or you kind of unify them? On Sat, Jul 18, 2020 at 12:02

Re: Catalog API for Partition

2020-07-17 Thread Wenchen Fan
In Hive, partition does two things: 1. Act as an index to speed up data scan 2. Act as a way to manage the data. People can add/drop partitions. How do you unify these 2 things in your API design? On Fri, Jul 17, 2020 at 12:03 AM JackyLee wrote: > Hi devs, > > In order to support Partition Comm

Re: [DISCUSS] -1s and commits

2020-07-16 Thread Wenchen Fan
It looks like there are two topics: 1. PRs with -1 2. PRs with someone asking to wait for certain days. Holden, it seems you are hitting 2? I think 2 can be problematic if there are people who keep asking to wait, and block the PR indefinitely. But if it's only asked once, this seems OK. BTW, sinc

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-15 Thread Wenchen Fan
is was not done for Spark 2.4.6 because it was too late on the vote > process but it makes perfect sense to have this in 2.4.7. > > On Wed, Jul 15, 2020 at 9:07 AM Wenchen Fan wrote: > > > > Yea I think 2.4.7 is good to go. Let's start! > > > > On Wed, Jul 15, 202

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-15 Thread Wenchen Fan
Yea I think 2.4.7 is good to go. Let's start! On Wed, Jul 15, 2020 at 1:50 PM Prashant Sharma wrote: > Hi Folks, > > So, I am back, and searched the JIRAS with target version as "2.4.7" and > Resolved, found only 2 jiras. So, are we good to go, with just a couple of > jiras fixed ? Shall I proce

Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Wenchen Fan
Congrats and welcome! On Wed, Jul 15, 2020 at 2:18 PM Mridul Muralidharan wrote: > > Congratulations ! > > Regards, > Mridul > > On Tue, Jul 14, 2020 at 12:37 PM Matei Zaharia > wrote: > >> Hi all, >> >> The Spark PMC recently voted to add several new committers. Please join >> me in welcoming

Re: [PSA] Apache Spark uses GitHub Actions to run the tests

2020-07-14 Thread Wenchen Fan
To clarify, we need to wait for: 1. Java documentation build test in github actions 2. dependency test in github actions 3. either github action all green or jenkin pass If the PR touches Kinesis, or it uses other profiles, we must wait for jenkins to pass. Do I miss something? On Tue, Jul 14, 2

Re: [VOTE] Decommissioning SPIP

2020-07-02 Thread Wenchen Fan
+1 On Fri, Jul 3, 2020 at 12:06 AM DB Tsai wrote: > +1 > > On Thu, Jul 2, 2020 at 8:59 AM Ryan Blue > wrote: > >> +1 >> >> On Thu, Jul 2, 2020 at 8:00 AM Dongjoon Hyun >> wrote: >> >>> +1. >>> >>> Thank you, Holden. >>> >>> Bests, >>> Dongjoon. >>> >>> On Thu, Jul 2, 2020 at 6:43 AM wuyi wrot

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-30 Thread Wenchen Fan
Hi Jason, Thanks for reporting! https://issues.apache.org/jira/browse/SPARK-32136 looks like a breaking change and we should investigate. On Wed, Jul 1, 2020 at 11:31 AM Holden Karau wrote: > I can take care of 2.4.7 unless someone else wants to do it. > > On Tue, Jun 30, 2020 at 8:29 PM Jason

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Wenchen Fan
Shall we start a new thread to discuss the bundled Hadoop version in PySpark? I don't have a strong opinion on changing the default, as users can still download the Hadoop 2.7 version. On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun wrote: > To Xiao. > Why Apache project releases should be blocked

Re: Datasource with ColumnBatchScan support.

2020-06-17 Thread Wenchen Fan
If you already have your own `FileFormat` implementation: just override the `supportBatch` method. On Tue, Jun 16, 2020 at 5:39 AM Nasrulla Khan Haris wrote: > HI Spark developers, > > > > FileSourceScanExec >

Re: Quick sync: what goes in migration guide vs release notes?

2020-06-10 Thread Wenchen Fan
or changes, neither of these > accomplishes that. That's valuable, but is what a summary blog is for. > > I can't feel strongly about this, so, would just say, propose process > changes for 3.1 and codify in the contributing guide but stick with what we > have for 3.0. > &

Re: Quick sync: what goes in migration guide vs release notes?

2020-06-10 Thread Wenchen Fan
oing to get included in release notes. >> They aren't anywhere then (3.0 is done, so not the migration guide). Some >> are important. >> Change could be OK but how about proposing this going forward? >> >> >> On Wed, Jun 10, 2020 at 10:35 AM Wenchen Fan

Re: Quick sync: what goes in migration guide vs release notes?

2020-06-10 Thread Wenchen Fan
My 2 cents: Since we have a migration guide, I think people who hit problems when upgrading Spark will read it. We should mention all the breaking changes there, except for trivial ones like obvious bug fixes. Even if there is no meaningful migration to guide for things like removing a deprecated

Re: [vote] Apache Spark 3.0 RC3

2020-06-09 Thread Wenchen Fan
+1 (binding) On Tue, Jun 9, 2020 at 6:15 PM Dr. Kent Yao wrote: > +1 (non-binding) > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.or

Re: [VOTE] Release Spark 2.4.6 (RC8)

2020-05-31 Thread Wenchen Fan
+1 (binding), although I don't know why we jump from RC 3 to RC 8... On Mon, Jun 1, 2020 at 7:47 AM Holden Karau wrote: > Please vote on releasing the following candidate as Apache Spark > version 2.4.6. > > The vote is open until June 5th at 9AM PST and passes if a majority +1 PMC > votes are c

Re: [VOTE] Apache Spark 3.0 RC2

2020-05-20 Thread Wenchen Fan
Seems the priority of SPARK-31706 is incorrectly marked, and it's a blocker now. The fix was merged just a few hours ago. This should be a -1 for RC2. On Wed, May 20, 2020 at 2:42 PM rickestcode wrote: > +1 > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > -

Re: [VOTE] Release Spark 2.4.6 (RC3)

2020-05-18 Thread Wenchen Fan
+1, no known blockers. On Mon, May 18, 2020 at 12:49 AM DB Tsai wrote: > +1 as well. Thanks. > > On Sun, May 17, 2020 at 7:39 AM Sean Owen wrote: > >> +1 , same response as to the last RC. >> This looks like it includes the fix discussed last time, as well as a >> few more small good fixes. >>

Re: [DatasourceV2] Allowing Partial Writes to DSV2 Tables

2020-05-13 Thread Wenchen Fan
I think we already have this table capacity: ACCEPT_ANY_SCHEMA. Can you try that? On Thu, May 14, 2020 at 6:17 AM Russell Spitzer wrote: > I would really appreciate that, I'm probably going to just write a planner > rule for now which matches up my table schema with the query output if they > ar

Re: [Datasource V2] Exception Handling for Catalogs - Naming Suggestions

2020-05-13 Thread Wenchen Fan
This looks a bit specific and maybe it's better to allow catalogs to customize the error message, which is more general. On Wed, May 13, 2020 at 12:16 AM Russell Spitzer wrote: > Currently the way some actions work, we receive an error during analysis > phase. For example, doing a "SELECT * FROM

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-11 Thread Wenchen Fan
SPARK-30098 was merged about 6 months ago. It's not a clean revert and we may need to spend quite a bit of time to resolve conflicts and fix tests. I don't see why it's still a problem if a feature is disabled and hidden from end-users (it's undocumented, the config is internal). The related code

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

2020-05-10 Thread Wenchen Fan
t; |18995|243603134985| > |18991|476309451025| > |18993|287916490001| > |18998|324427845137| > |18992|412640801297| > |18994|302012976401| > +-++ > ... > > This can happen with such inconsistent schemas because State in Structured > Streaming doesn'

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-08 Thread Wenchen Fan
xed (though the consideration of severity seems to be >>>> different), and once we notice the issue it would be really odd if we >>>> publish it as it is, and try to fix it later (the fix may not be even >>>> included in 3.0.x as it might bring behavioral c

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

2020-05-08 Thread Wenchen Fan
Can you give some simple examples to demonstrate the problem? I think the inconsistency would bring problems but don't know how. On Fri, May 8, 2020 at 3:49 PM Jungtaek Lim wrote: > (bump to expose the discussion to more readers) > > On Mon, May 4, 2020 at 4:57 PM Jungtaek Lim > wrote: > >> Hi

Re: is there any tool to visualize the spark physical plan or spark plan

2020-04-30 Thread Wenchen Fan
Does the Spark SQL web UI work for you? https://spark.apache.org/docs/3.0.0-preview/web-ui.html#sql-tab On Thu, Apr 30, 2020 at 5:30 PM Manu Zhang wrote: > Hi Kelly, > > If you can parse event log, then try listening on > `SparkListenerSQLExecutionStart` event and build a `SparkPlanGraph` like >

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-27 Thread Wenchen Fan
IIUC We are moving away from having 2 classes for Java and Scala, like JavaRDD and RDD. It's much simpler to maintain and use with a single class. I don't have a strong preference over option 3 or 4. We may need to collect more data points from actual users. On Mon, Apr 27, 2020 at 9:50 PM Hyukji

Re: Error while reading hive tables with tmp/hidden files inside partitions

2020-04-23 Thread Wenchen Fan
ards,Dhrubajyoti Hati.* > > > On Thu, Apr 23, 2020 at 10:12 AM Wenchen Fan wrote: > >> This looks like a bug that path filter doesn't work for hive table >> reading. Can you open a JIRA ticket? >> >> On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati >&g

Re: Error while reading hive tables with tmp/hidden files inside partitions

2020-04-22 Thread Wenchen Fan
This looks like a bug that path filter doesn't work for hive table reading. Can you open a JIRA ticket? On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati wrote: > Just wondering if any one could help me out on this. > > Thank you! > > > > > *Regards,Dhrubajyoti Hati.* > > > On Wed, Apr 22, 2020 a

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-09 Thread Wenchen Fan
The ongoing critical issues I'm aware of are: SPARK-31257 : Fix ambiguous two different CREATE TABLE syntaxes SPARK-31404 : backward compatibility issues after switching to Proleptic Gregorian cale

Re: DSv2 & DataSourceRegister

2020-04-08 Thread Wenchen Fan
t; Hello > > On Tue, Apr 7, 2020 at 23:16 Wenchen Fan wrote: > >> Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not >> sure this is possible as the DS V2 API is very different in 3.0, e.g. there >> is no `DataSourceV2` anymore, and you should im

Re: DSv2 & DataSourceRegister

2020-04-07 Thread Wenchen Fan
Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not sure this is possible as the DS V2 API is very different in 3.0, e.g. there is no `DataSourceV2` anymore, and you should implement `TableProvider` (if you don't have database/table). On Wed, Apr 8, 2020 at 6:58 AM Andrew Mel

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-03-31 Thread Wenchen Fan
Yea, release candidates are different from the preview version, as release candidates are not official releases, so they won't appear in Maven Central, can't be downloaded in the Spark official website, etc. On Wed, Apr 1, 2020 at 12:32 PM Sean Owen wrote: > These are release candidates, not the

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-29 Thread Wenchen Fan
I agree that we can cut the RC anyway even if there are blockers, to move us to a more official "code freeze" status. About the CREATE TABLE unification, it's still WIP and not close-to-merge yet. Can we fix some specific problems like CREATE EXTERNAL TABLE surgically and leave the unification to

Re: Programmatic: parquet file corruption error

2020-03-27 Thread Wenchen Fan
Running Spark application with an IDE is not officially supported. It may work under some cases but there is no guarantee at all. The official way is to run interactive queries with spark-shell or package your application to a jar and use spark-submit. On Thu, Mar 26, 2020 at 4:12 PM Zahid Rahman

Re: BUG: take with SparkSession.master[url]

2020-03-26 Thread Wenchen Fan
g from maven. > > Backbutton.co.uk > ¯\_(ツ)_/¯ > ♡۶Java♡۶RMI ♡۶ > Make Use Method {MUM} > makeuse.org > <http://www.backbutton.co.uk> > > > On Fri, 27 Mar 2020 at 05:45, Wenchen Fan wrote: > >> Which Spark/Scala version do you use? >> >> On

Re: BUG: take with SparkSession.master[url]

2020-03-26 Thread Wenchen Fan
Which Spark/Scala version do you use? On Fri, Mar 27, 2020 at 1:24 PM Zahid Rahman wrote: > > with the following sparksession configuration > > val spark = SparkSession.builder().master("local[*]").appName("Spark Session > take").getOrCreate(); > > this line works > > flights.filter(flight_row

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-23 Thread Wenchen Fan
Hi Ryan, It's great to hear that you are cleaning up this long-standing mess. Please let me know if you hit any problems that I can help with. Thanks, Wenchen On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas wrote: > On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan wrote: > >>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-19 Thread Wenchen Fan
> Hive create table syntax, or just use beeline with Hive connected. >>>> >>>> But since we are even thinking about native syntax as a first class and >>>> dropping Hive one implicitly (hide in doc) or explicitly, does it really >>>> matter we re

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Wenchen Fan
'm not sure how we can >> only improve documentation to make things be clear, but if the approach >> would be explaining the difference of rules and guide the tip to make the >> query be bound to the specific rule, the same could be applied to parser >> rule to address t

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Wenchen Fan
orders when we explain in the doc) > by themselves to understand which provider the table will leverage? I'm > sorry but I think we are making bad assumption on end users which is a > serious problem. > > If we really want to promote Spark's one for CREATE TABLE, then would it &g

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Wenchen Fan
I think the general guideline is to promote Spark's own CREATE TABLE syntax instead of the Hive one. Previously these two rules are mutually exclusive because the native syntax requires the USING clause while the Hive syntax makes ROW FORMAT or STORED AS clause optional. It's a good move to make t

Re: Spark 2.4.x and 3.x datasourcev2 api documentation & references

2020-03-18 Thread Wenchen Fan
For now you can take a look at `DataSourceV2Suite`, which contains both Java/Scala implementations. There is also an ongoing PR to implement catalog APIs for JDBC: https://github.com/apache/spark/pull/27345 We are still working on the user guide. On Mon, Mar 16, 2020 at 4:59 AM MadDoxX wrote: >

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Wenchen Fan
OK let me put a proposal here: 1. Permanently ban CHAR for native data source tables, and only keep it for Hive compatibility. It's OK to forget about padding like what Snowflake and MySQL have done. But it's hard for Spark to require consistent behavior about CHAR type in all data sources. Since

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Wenchen Fan
I agree that Spark can define the semantic of CHAR(x) differently than the SQL standard (no padding), and ask the data sources to follow it. But the problem is, some data sources may not be able to skip padding, like the Hive serde table. On the other hand, it's easier to require padding for CHAR(

Re: [DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

2020-03-16 Thread Wenchen Fan
I don't think option 1 is possible. For option 2: I think we need to do it anyway. It's kind of a bug that the typed Scala UDF doesn't support case class that thus can't support struct-type input columns. For option 3: It's a bit risky to add a new API but seems like we have a good reason. The un

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-09 Thread Wenchen Fan
+1 (binding), assuming that this is for public stable APIs, not APIs that are marked as unstable, evolving, etc. On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía wrote: > +1 (non-binding) > > Michael's section on the trade-offs of maintaining / removing an API are > one of > the best reads I have see

Re: Datasource V2 support in Spark 3.x

2020-03-05 Thread Wenchen Fan
Data Source V2 has evolved to Connector API which supports both data (the data source API) and metadata (the catalog API). The new APIs are under package org.apache.spark.sql.connector You can keep using Data Source V1 as there is no plan to deprecate it in the near future. But if you'd like to t

Re: [DISCUSSION] Avoiding duplicate work

2020-02-21 Thread Wenchen Fan
The JIRA ticket will show the linked PR if there are any, which indicates that someone is working on it if the PR is active. Maybe the bot should also leave a comment on the JIRA ticket to make it clearer? On Fri, Feb 21, 2020 at 10:54 PM younggyu Chun wrote: > Hi All, > > I would like to sugges

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-18 Thread Wenchen Fan
2. foldable srcStr + non-foldable trimStr >>> 3. non-foldable srcStr + foldable trimStr >>> 4. non-foldable srcStr + non-foldable trimStr >>> >>> The case # 2 seems a rare case, and # 3 is probably the most common >>> case. Once we see the second cas

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-15 Thread Wenchen Fan
It's unfortunate that we don't have a clear document to talk about breaking changes (I'm working on it BTW). I believe the general guidance is: *avoid breaking changes unless we have to*. For example, the previous result was so broken that we have to fix it, moving to Scala 2.12 makes us have to br

Re: Adaptive Query Execution performance results in 3TB TPC-DS

2020-02-13 Thread Wenchen Fan
Thanks for providing the benchmark numbers! The result is very promising and I'm looking forward to seeing more feedback from real-world workloads. On Wed, Feb 12, 2020 at 3:43 PM Jia, Ke A wrote: > Hi all, > > We have completed the Spark 3.0 Adaptive Query Execution(AQE) performance > tests in

Re: [DISCUSS] naming policy of Spark configs

2020-02-12 Thread Wenchen Fan
>>> >>>> The new policy looks clear to me. +1 for the explicit policy. >>>> >>>> So, are we going to revise the existing conf names before 3.0.0 release? >>>> >>>> Or, is it applied to new up-coming configurations from now? >>>

[DISCUSS] naming policy of Spark configs

2020-02-12 Thread Wenchen Fan
Hi all, I'd like to discuss the naming policy of Spark configs, as for now it depends on personal preference which leads to inconsistent namings. In general, the config name should be a noun that describes its meaning clearly. Good examples: spark.sql.session.timeZone spark.sql.streaming.continuo

Re: Request to document the direct relationship between other configurations

2020-02-12 Thread Wenchen Fan
In general I think it's better to have more detailed documents, but we don't have to force everyone to do it if the config name is structured. I would +1 to document the relationship of we can't tell it from the config names, e.g. spark.shuffle.service.enabled and spark.dynamicAllocation.enabled.

Re: comparable and orderable CalendarInterval

2020-02-11 Thread Wenchen Fan
What's your use case to compare intervals? It's tricky in Spark as there is only one interval type and you can't really compare one month with 30 days. On Wed, Feb 12, 2020 at 12:01 AM Enrico Minack wrote: > Hi Devs, > > I would like to know what is the current roadmap of making > CalendarInterv

Re: [ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-10 Thread Wenchen Fan
Great Job, Dongjoon! On Mon, Feb 10, 2020 at 4:18 PM Hyukjin Kwon wrote: > Thanks Dongjoon! > > 2020년 2월 9일 (일) 오전 10:49, Takeshi Yamamuro 님이 작성: > >> Happy to hear the release news! >> >> Bests, >> Takeshi >> >> On Sun, Feb 9, 2020 at 10:28 AM Dongjoon Hyun >> wrote: >> >>> There was a typo in

Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

2020-02-05 Thread Wenchen Fan
This is a hack really and we don't recommend users to access internal classes directly. That's why there is no public document. If you really need to do it and are aware of the risks, you can read the source code. All expressions (or the so-called "native UDF") extend the base class `Expression`.

Re: [VOTE] Release Apache Spark 2.4.5 (RC2)

2020-02-03 Thread Wenchen Fan
AFAIK there is no ongoing critical bug fixes, +1 On Mon, Feb 3, 2020 at 11:46 PM Dongjoon Hyun wrote: > Yes, it does officially since 2.4.0. > > 2.4.5 is a maintenance release of 2.4.x line and the community didn't > support Hadoop 3.x on 'branch-2.4'. We didn't run test at all. > > Bests, > Don

Re: [FYI] `Target Version` on `Improvement`/`New Feature` JIRA issues

2020-02-02 Thread Wenchen Fan
Thanks for cleaning this up! On Sun, Feb 2, 2020 at 2:08 PM Xiao Li wrote: > Thanks! Dongjoon. > > Xiao > > On Sat, Feb 1, 2020 at 5:15 PM Hyukjin Kwon wrote: > >> Thanks Dongjoon. >> >> On Sun, 2 Feb 2020, 09:08 Dongjoon Hyun, wrote: >> >>> Hi, All. >>> >>> From Today, we have `branch-3.0` as

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-23 Thread Wenchen Fan
een made to >>>> this API to address the main concerns mentioned. >>>> Also, the followup JIRA requested seems still open >>>> https://issues.apache.org/jira/browse/SPARK-27386 >>>> I heard this was already discussed but I cannot find the summary of the

Re: Enabling push-based shuffle in Spark

2020-01-23 Thread Wenchen Fan
The name "push-based shuffle" is a little misleading. This seems like a better shuffle service that co-locates shuffle blocks of one reducer at the map phase. I think this is a good idea. Is it possible to make it completely external via the shuffle plugin API? This looks like a good use case of th

Re: Correctness and data loss issues

2020-01-21 Thread Wenchen Fan
I think we need to go through them during the 3.0 QA period, and try to fix the valid ones. For example, the first ticket should be fixed already in https://issues.apache.org/jira/browse/SPARK-28344 On Mon, Jan 20, 2020 at 2:07 PM Dongjoon Hyun wrote: > Hi, All. > > According to our policy, "Co

Re: [Discuss] Metrics Support for DS V2

2020-01-17 Thread Wenchen Fan
I think there are a few details we need to discuss. how frequently a source should update its metrics? For example, if file source needs to report size metrics per row, it'll be super slow. what metrics a source should report? data size? numFiles? read time? shall we show metrics in SQL web UI a

Re: [DISCUSS] Support year-month and day-time Intervals

2020-01-16 Thread Wenchen Fan
The proposal makes sense to me. If we are not going to make interval type ANSI-compliant in this release, we should not expose it widely. Thanks for driving it, Kent! On Fri, Jan 17, 2020 at 10:52 AM Dr. Kent Yao wrote: > Following ANSI might be a good option but also a serious user behavior >

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Wenchen Fan
The DS v2 project is still evolving so half-backed is inevitable sometimes. This feature is definitely in the right direction to allow more flexible partition implementations, but there are a few problems we can discuss. About expression duplication. This is an existing design choice. We don't wan

Re: [VOTE] Release Apache Spark 2.4.5 (RC1)

2020-01-15 Thread Wenchen Fan
Recently we merged several fixes to 2.4: https://issues.apache.org/jira/browse/SPARK-30325 a driver hang issue https://issues.apache.org/jira/browse/SPARK-30246 a memory leak issue https://issues.apache.org/jira/browse/SPARK-29708 a correctness issue(for a rarely used feature, so not merged t

Re: Question about Datasource V2

2020-01-13 Thread Wenchen Fan
1. we plan to add view support in future releases. 2. can you open a JIRA ticket? This seems like a bug to me. 3. instead of defining a lot of fields in the table, we decide to use properties to keep all the extra information. We've defined some reserved properties like "comment", "location", which

Re: How executor Understand which RDDs needed to be persist from the submitted Task

2020-01-09 Thread Wenchen Fan
s. > > Iacovos > On 1/9/20 5:03 PM, Wenchen Fan wrote: > > RDD has a flag `storageLevel` which will be set by calling persist(). RDD > will be serialized and sent to executors for running tasks. So executors > just look at RDD.storageLevel and store output in its block manag

Re: How executor Understand which RDDs needed to be persist from the submitted Task

2020-01-09 Thread Wenchen Fan
RDD has a flag `storageLevel` which will be set by calling persist(). RDD will be serialized and sent to executors for running tasks. So executors just look at RDD.storageLevel and store output in its block manager when needed. On Thu, Jan 9, 2020 at 5:53 PM Jack Kolokasis wrote: > Hello all, >

Re: [SPARK-30319][SQL] Add a stricter version of as[T]

2020-01-07 Thread Wenchen Fan
I think it's simply because as[T] is lazy. You will see the right schema if you do `df.as[T].map(identity)`. On Tue, Jan 7, 2020 at 4:42 PM Enrico Minack wrote: > Hi Devs, > > I'd like to propose a stricter version of as[T]. Given the interface def > as[T](): Dataset[T], it is counter-intuitiv

Re: [DISCUSS] Support subdirectories when accessing partitioned Parquet Hive table

2020-01-06 Thread Wenchen Fan
Isn't your directory structure malformed? The directory name under the table path should be in the form of "partitionCol=value". And AFAIK this is the Hive standard. On Mon, Jan 6, 2020 at 6:59 PM Lotkowski, Michael wrote: > Hi all, > > > > Reviving this thread, we still have this issue and we

Re: Release Apache Spark 2.4.5

2020-01-05 Thread Wenchen Fan
+1 On Mon, Jan 6, 2020 at 12:02 PM Jungtaek Lim wrote: > +1 to have another Spark 2.4 release, as Spark 2.4.4 was released in 4 > months old and there's release window for this. > > On Mon, Jan 6, 2020 at 12:38 PM Hyukjin Kwon wrote: > >> Yeah, I think it's nice to have another maintenance rele

Re: Fw:Re: [VOTE][SPARK-29018][SPIP]:Build spark thrift server based on protocol v11

2019-12-29 Thread Wenchen Fan
+1 for the new thrift server to get rid of the Hive dependencies! On Mon, Dec 23, 2019 at 7:55 PM Yuming Wang wrote: > I'm +1 for this SPIP for these two reasons: > > 1. The current thriftserver has some issues that are not easy to solve, > such as: SPARK-28636

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2019-12-23 Thread Wenchen Fan
Sounds good! On Tue, Dec 24, 2019 at 7:48 AM Reynold Xin wrote: > We've pushed out 3.0 multiple times. The latest release window documented > on the website says we'd > code freeze and cut branch-3.0 early Dec. It looks like we are suffering a > b

Re: [VOTE] SPARK 3.0.0-preview2 (RC2)

2019-12-18 Thread Wenchen Fan
+1, all tests pass On Thu, Dec 19, 2019 at 7:18 AM Takeshi Yamamuro wrote: > Thanks, Yuming! > > I checked the links and the prepared binaries. > Also, I run tests with -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver > -Pmesos -Pkubernetes -Psparkr > on java version "1.8.0_181. > All the things

Re: how to get partition column info in Data Source V2 writer

2019-12-18 Thread Wenchen Fan
Hi Aakash, You can try the latest DS v2 with the 3.0 preview, and the API is in a quite stable shape now. With the latest API, a Writer is created from a Table, and the Table has the partitioning information. Thanks, Wenchen On Wed, Dec 18, 2019 at 3:22 AM aakash aakash wrote: > Thanks Andrew!

Re: I would like to add JDBCDialect to support Vertica database

2019-12-11 Thread Wenchen Fan
Can we make the JDBCDialect a public API that users can plugin? It looks like an end-less job to make sure Spark JDBC source supports all databases. On Wed, Dec 11, 2019 at 11:41 PM Xiao Li wrote: > You can follow how we test the other JDBC dialects. All JDBC dialects > require the docker integr

Re: Release Apache Spark 2.4.5 and 2.4.6

2019-12-10 Thread Wenchen Fan
Sounds good. Thanks for bringing this up! On Wed, Dec 11, 2019 at 3:18 PM Takeshi Yamamuro wrote: > That looks nice, thanks! > I checked the previous v2.4.4 release; it has around 130 commits (from > 2.4.3 to 2.4.4), so > I think branch-2.4 already has enough commits for the next release. > > A

Re: [DISCUSS] Add close() on DataWriter interface

2019-12-10 Thread Wenchen Fan
PartitionReader extends Closable, seems reasonable to me to do the same for DataWriter. On Wed, Dec 11, 2019 at 1:35 PM Jungtaek Lim wrote: > Hi devs, > > I'd like to propose to add close() on DataWriter explicitly, which is the > place for resource cleanup. > > The rationalization of the propos

Re: DataSourceWriter V2 Api questions

2019-12-05 Thread Wenchen Fan
process to merge the underlying tables on a periodic basis. >> >> It gets messy and probably moves you towards a write-once only tables, >> etc. >> >> >> >> Finally using views in a generic mongoDB connector may not be good and >> flexible enough. >

Re: [DISCUSS] Consistent relation resolution behavior in SparkSQL

2019-12-04 Thread Wenchen Fan
esolution > proposal > <https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing> > . > > Note that this proposal is a breaking change, but the impact should be > minimal since this applies only when there are temp views and tables with > the same n

Re: Slower than usual on PRs

2019-12-02 Thread Wenchen Fan
Sorry to hear that. Hope you get better soon! On Tue, Dec 3, 2019 at 1:28 AM Holden Karau wrote: > Hi Spark dev folks, > > Just an FYI I'm out dealing with recovering from a motorcycle accident so > my lack of (or slow) responses on PRs/docs is health related and please > don't block on any of m

Re: Fw:Re:Re: A question about radd bytes size

2019-12-02 Thread Wenchen Fan
发件人:"zhangliyun" > 发送日期:2019-12-03 05:56:55 > 收件人:"Wenchen Fan" > 主题:Re:Re: A question about radd bytes size > > Hi Fan: >thanks for reply, I agree that the how the data is stored decides the > total bytes of the table file. > In my experiment, I fou

Re: A question about radd bytes size

2019-12-01 Thread Wenchen Fan
When we talk about bytes size, we need to specify how the data is stored. For example, if we cache the dataframe, then the bytes size is the number of bytes of the binary format of the table cache. If we write to hive tables, then the bytes size is the total size of the data files of the table. On

[DISCUSS] PostgreSQL dialect

2019-11-26 Thread Wenchen Fan
Hi all, Recently we start an effort to achieve feature parity between Spark and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764 This goes very well. We've added many missing features(parser rules, built-in functions, etc.) to Spark, and also corrected several inappropriate behaviors

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-16 Thread Wenchen Fan
Do we have a limitation on the number of pre-built distributions? Seems this time we need 1. hadoop 2.7 + hive 1.2 2. hadoop 2.7 + hive 2.3 3. hadoop 3 + hive 2.3 AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so don't need to add JDK version to the combination. On Sat, Nov 16,

Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

2019-11-10 Thread Wenchen Fan
fle hash join? Like code generation for ShuffledHashJoinExec or > something…. > > > > *From: *Wenchen Fan > *Date: *Sunday, November 10, 2019 at 5:57 PM > *To: *"Wang, Gang" > *Cc: *"dev@spark.apache.org" > *Subject: *Re: Why not implement CodegenSupport

Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

2019-11-10 Thread Wenchen Fan
By default sort merge join is preferred over shuffle hash join, that's why we haven't spend resources to implement codegen for it. On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang wrote: > There are some cases, shuffle hash join performs even better than sort > merge join. > > While, I noticed that Sh

Re: [DISCUSS] Expensive deterministic UDFs

2019-11-07 Thread Wenchen Fan
We really need some documents to define what non-deterministic means. AFAIK, non-deterministic expressions may produce a different result for the same input row, if the already processed input rows are different. The optimizer tries its best to not change the input sequence of non-deterministic ex

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-06 Thread Wenchen Fan
Sounds reasonable to me. We should make the behavior consistent within Spark. On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler wrote: > Currently, when a PySpark Row is created with keyword arguments, the > fields are sorted alphabetically. This has created a lot of confusion with > users because it

Re: [VOTE] SPARK 3.0.0-preview (RC2)

2019-10-31 Thread Wenchen Fan
The PR builder uses Hadoop 2.7 profile, which makes me think that 2.7 is more stable and we should make releases using 2.7 by default. +1 On Fri, Nov 1, 2019 at 7:16 AM Xiao Li wrote: > Spark 3.0 will still use the Hadoop 2.7 profile by default, I think. > Hadoop 2.7 profile is much more stable

Re: Re: A question about broadcast nest loop join

2019-10-23 Thread Wenchen Fan
Ah sorry I made a mistake. "Spark can only pick BroadcastNestedLoopJoin to implement left/right join" this should be "left/right non-equal join" On Thu, Oct 24, 2019 at 6:32 AM zhangliyun wrote: > > Hi Herman: >I guess what you mentioned before > ``` > if you are OK with slightly different N

<    1   2   3   4   5   6   >