Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-18 Thread Wenchen Fan
r case it >>>> seems related to your signature. >>>> >>>> failureMessageNo public key: Key with id: () was not able to be >>>> located on http://gpg-keyserver.de/. Upload your public key and try >>>> the operation again. >>>&g

Re: Pushdown in DataSourceV2 question

2018-12-09 Thread Wenchen Fan
expressions/functions can be expensive and I do think Spark should trust data source and not re-apply pushed filters. If data source lies, many things can go wrong... On Sun, Dec 9, 2018 at 8:17 PM Jörn Franke wrote: > Well even if it has to apply it again, if pushdown is activated then it >

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard. 1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
or not missing columns are > OK and let the Datasource deal with the missing data based on it's > underlying storage. > > On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan wrote: > >> I agree that we should not rewrite existing parquet files when a new >> column is added, but we should a

Re: Noisy spark-website notifications

2018-12-19 Thread Wenchen Fan
+1, at least it should only send one email when a PR is merged. On Thu, Dec 20, 2018 at 10:58 AM Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Can we somehow disable these new email alerts coming through for the Spark > website repo? > > On Wed, Dec 19, 2018 at 8:25 PM GitBox wrote: >

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
syntax. Isn't the right thing (for > option 1) to pass the default through to the underlying data source? > Sources that don't support defaults would throw an exception. > > On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan wrote: > >> The standard ADD COLUMN SQL syntax is: ALTER

Re: [DISCUSS] Spark Columnar Processing

2019-03-25 Thread Wenchen Fan
Do you have some initial perf numbers? It seems fine to me to remain row-based inside Spark with whole-stage-codegen, and convert rows to columnar batches when communicating with external systems. On Mon, Mar 25, 2019 at 1:05 PM Bobby Evans wrote: > This thread is to discuss adding in support

Re: [VOTE] Release Apache Spark 2.4.1 (RC9)

2019-03-27 Thread Wenchen Fan
+1, all the known blockers are resolved. Thanks for driving this! On Wed, Mar 27, 2019 at 11:31 AM DB Tsai wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.4.1. > > The vote is open until March 30 PST and passes if a majority +1 PMC votes > are cast, with >

Re: Cross Join

2019-03-22 Thread Wenchen Fan
Spark 2.0 is EOL. Can you try 2.3 or 2.4? On Thu, Mar 21, 2019 at 10:23 AM asma zgolli wrote: > Hello , > > I need to cross my data and i'm executing a cross join on two dataframes . > > C = A.crossJoin(B) > A has 50 records > B has 5 records > > the result im getting with spark 2.0 is a

Re: DataSourceV2 exceptions

2019-04-08 Thread Wenchen Fan
Like `RDD.map`, you can throw whatever exceptions and they will be propagated to the driver side and fail the Spark job. On Mon, Apr 8, 2019 at 3:10 PM Andrew Melo wrote: > Hello, > > I'm developing a (java) DataSourceV2 to read a columnar fileformat > popular in a number of physical sciences

Re: [VOTE] SPIP: Identifiers for multi-catalog Spark

2019-02-18 Thread Wenchen Fan
+1 On Tue, Feb 19, 2019 at 10:50 AM Ryan Blue wrote: > Hi everyone, > > It looks like there is consensus on the proposal, so I'd like to start a > vote thread on the SPIP for identifiers in multi-catalog Spark. > > The doc is available here: >

Re: [DISCUSS] SPIP: Identifiers for multi-catalog Spark

2019-02-18 Thread Wenchen Fan
I think this is the right direction to go. Shall we move forward with a vote and detailed designs? On Mon, Feb 4, 2019 at 9:57 AM Ryan Blue wrote: > Hi everyone, > > This is a follow-up to the "Identifiers with multi-catalog support" > discussion thread. I've taken the proposal I posted to that

Re: [VOTE] SPIP: Spark API for Table Metadata

2019-03-01 Thread Wenchen Fan
+1, thanks for making it clear that this SPIP focuses on high-level direction! On Sat, Mar 2, 2019 at 9:35 AM Reynold Xin wrote: > Thanks Ryan. +1. > > > > > On Fri, Mar 01, 2019 at 5:33 PM, Ryan Blue wrote: > >> Actually, I went ahead and removed the confusing section. There is no >> public

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-01 Thread Wenchen Fan
+1 On Sat, Mar 2, 2019 at 6:11 AM Yinan Li wrote: > +1 > > On Fri, Mar 1, 2019 at 12:37 PM Tom Graves > wrote: > >> +1 for the SPIP. >> >> Tom >> >> On Friday, March 1, 2019, 8:14:43 AM CST, Xingbo Jiang < >> jiangxb1...@gmail.com> wrote: >> >> >> Hi all, >> >> I want to call for a vote of

Re: Moving forward with the timestamp proposal

2019-02-20 Thread Wenchen Fan
I think this is the right direction to go, but I'm wondering how can Spark support these new types if the underlying data sources(like parquet files) do not support them yet. I took a quick look at the new doc for file formats, but not sure what's the proposal. Are we going to implement these new

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-27 Thread Wenchen Fan
it in 3.0. >> > >> > On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah wrote: >> > Will that then require an API break down the line? Do we save that for >> Spark 4? >> > >> > >> > >> > >> > -Matt Cheah? >> > >> >

Re: [build system] jenkins wedged again, rebooting master node

2019-03-15 Thread Wenchen Fan
cool, thanks! On Sat, Mar 16, 2019 at 1:08 AM shane knapp wrote: > well, that box rebooted in record time! we're back up and building. > > and as always, i'll keep a close eye on things today... jenkins usually > works great, until it doesn't. :\ > > On Fri, Mar 15, 2019 at 9:52 AM shane

Re: [VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-10 Thread Wenchen Fan
Which version of Parquet has this bug? Maybe we can downgrade it. On Mon, Mar 11, 2019 at 10:34 AM Mark Hamstra wrote: > It worked in 2.3. We broke it with 2.4.0 and were informed of that > regression late in the 2.4.0 release process. Since we didn't fix it before > the 2.4.0 release, it

Re: spark sql occer error

2019-03-22 Thread Wenchen Fan
Did you include the whole error message? On Fri, Mar 22, 2019 at 12:45 AM 563280193 <563280...@qq.com> wrote: > Hi , > I ran a spark sql like this: > > *select imei,tag, product_id,* > * sum(case when succ1>=1 then 1 else 0 end) as succ,* > * sum(case when fail1>=1 and succ1=0 then 1

Re: Manually reading parquet files.

2019-03-22 Thread Wenchen Fan
Try `val enconder = RowEncoder(df.schema).resolveAndBind()` ? On Thu, Mar 21, 2019 at 5:39 PM Long, Andrew wrote: > Thanks a ton for the help! > > > > Is there a standardized way of converting the internal row to rows? > > > > I’ve tried this but im getting an exception > > > > *val *enconder =

Re: Compatibility on build-in DateTime functions with Hive/Presto

2019-02-17 Thread Wenchen Fan
in the Spark SQL example, `year("1912")` means, first cast "1912" to date type, and then call the "year" function. in the Postgres example, `date_part('year',TIMESTAMP '2017')` means, get a timestamp literal, and call the "date_part" function. Can you try date literal in Postgres? On Mon, Feb

Re: [ANNOUNCE] Announcing Apache Spark 2.3.3

2019-02-18 Thread Wenchen Fan
great job! On Mon, Feb 18, 2019 at 4:24 PM Hyukjin Kwon wrote: > Yay! Good job Takeshi! > > On Mon, 18 Feb 2019, 14:47 Takeshi Yamamuro >> We are happy to announce the availability of Spark 2.3.3! >> >> Apache Spark 2.3.3 is a maintenance release, based on the branch-2.3 >> maintenance branch

Re: Time to cut an Apache 2.4.1 release?

2019-02-12 Thread Wenchen Fan
+1 for 2.4.1 On Tue, Feb 12, 2019 at 7:55 PM Hyukjin Kwon wrote: > +1 for 2.4.1 > > 2019년 2월 12일 (화) 오후 4:56, Dongjin Lee 님이 작성: > >> > SPARK-23539 is a non-trivial improvement, so probably would not be >> back-ported to 2.4.x. >> >> Got it. It seems reasonable. >> >> Committers: >> >> Please

Re: Time to cut an Apache 2.4.1 release?

2019-02-14 Thread Wenchen Fan
Do you know which bug ORC 1.5.2 introduced? Or is it because Hive uses a legacy version of ORC which has a bug? On Thu, Feb 14, 2019 at 2:35 PM Darcy Shen wrote: > > We found that ORC table created by Spark 2.4 failed to be read by Hive > 2.1.1. > > > spark-sql -e 'CREATE TABLE tmp.orcTable2

Re: Tungsten Memory Consumer

2019-02-11 Thread Wenchen Fan
what do you mean by ''Tungsten Consumer"? On Fri, Feb 8, 2019 at 6:11 PM Jack Kolokasis wrote: > Hello all, > I am studying about Tungsten Project and I am wondering when Spark > creates a Tungsten consumer. While I am running some applications, I see > that Spark creates Tungsten Consumer

Re: [build system] Jenkins stopped working

2019-02-19 Thread Wenchen Fan
Thanks Shane! On Wed, Feb 20, 2019 at 6:48 AM shane knapp wrote: > alright, i increased the httpd and proxy timeouts and kicked apache. i'll > keep an eye on things, but as of right now we're happily building. > > On Tue, Feb 19, 2019 at 2:25 PM shane knapp wrote: > >> aand i had to issue

Re: Spark 2.4.2

2019-04-17 Thread Wenchen Fan
I volunteer to be the release manager for 2.4.2, as I was also going to propose 2.4.2 because of the reverting of SPARK-25250. Is there any other ongoing bug fixes we want to include in 2.4.2? If no I'd like to start the release process today (CST). Thanks, Wenchen On Thu, Apr 18, 2019 at 3:44

Re: Access to live data of cached dataFrame

2019-05-21 Thread Wenchen Fan
When you cache a dataframe, you actually cache a logical plan. That's why re-creating the dataframe doesn't work: Spark finds out the logical plan is cached and picks the cached data. You need to uncache the dataframe, or go back to the SQL way: df.createTempView("abc") spark.table("abc").cache()

Re: RDD object Out of scope.

2019-05-21 Thread Wenchen Fan
RDD is kind of a pointer to the actual data. Unless it's cached, we don't need to clean up the RDD. On Tue, May 21, 2019 at 1:48 PM Nasrulla Khan Haris wrote: > HI Spark developers, > > > > Can someone point out the code where RDD objects go out of scope ?. I > found the contextcleaner >

Re: [VOTE] Release Apache Spark 2.4.2

2019-04-29 Thread Wenchen Fan
-port and re-roll the > RC. That said I think we did / had to already drop the ability to build <= > 2.3 from the master release script already. > > On Sun, Apr 28, 2019 at 9:25 PM Wenchen Fan wrote: > >> > ... by using the release script of Spark 2.4 branch >> &

[VOTE] Release Apache Spark 2.4.2

2019-04-18 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 2.4.2. The vote is open until April 23 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.2 [ ] -1 Do not release this package because ... To

Re: Spark 2.4.2

2019-04-18 Thread Wenchen Fan
endencies but > bringing them into the process breaks users code easily) > > > -- > *From:* Michael Heuer > *Sent:* Thursday, April 18, 2019 6:41 AM > *To:* Reynold Xin > *Cc:* Sean Owen; Michael Armbrust; Ryan Blue; Spark Dev List; Wenchen > Fan;

Re: [VOTE] Release Apache Spark 2.4.3

2019-05-06 Thread Wenchen Fan
+1. The Scala version problem has been resolved, which is the main motivation of 2.4.3. On Mon, May 6, 2019 at 12:38 AM Felix Cheung wrote: > I ran basic tests on R, r-hub etc. LGTM. > > +1 (limited - I didn’t get to run other usual tests) > > -- > *From:* Sean Owen

Re: [VOTE] Release Apache Spark 2.4.2

2019-04-21 Thread Wenchen Fan
this one for sure as it's easy to > overlook with all the pages being updated per release. > > On Thu, Apr 18, 2019 at 9:51 PM Wenchen Fan wrote: > > > > Please vote on releasing the following candidate as Apache Spark version > 2.4.2. > > > > The vote is open until

Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Wenchen Fan
ml) > > On Wed, Apr 24, 2019 at 6:24 PM Wenchen Fan wrote: > >> How did you read/write the timestamp value from/to ORC file? >> >> On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia < >> shubh.chaura...@gmail.com> wrote: >> >>> Hi Al

Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Wenchen Fan
How did you read/write the timestamp value from/to ORC file? On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia wrote: > Hi All, > > Consider the following(spark v2.4.0): > > Basically I change values of `spark.sql.session.timeZone` and perform an > orc write. Here are 3 samples:- > > 1) >

Re: Disabling `Merge Commits` from GitHub Merge Button

2019-07-02 Thread Wenchen Fan
+1 as well On Tue, Jul 2, 2019 at 12:13 PM Dongjoon Hyun wrote: > Thank you so much for the replies, Reynold, Sean, Takeshi, Hyukjin! > > Bests, > Dongjoon. > > On Mon, Jul 1, 2019 at 6:00 PM Hyukjin Kwon wrote: > >> +1 >> >> 2019년 7월 2일 (화) 오전 9:39, Takeshi Yamamuro 님이 작성: >> >>> I'm also

Re: API for SparkContext ?

2019-06-30 Thread Wenchen Fan
You can call `SparkContext#addSparkListener` with a listener that implements `onApplicationEnd`. On Tue, May 14, 2019 at 1:51 AM Nasrulla Khan Haris wrote: > HI All, > > > > Is there a API for sparkContext where we can add our custom code before > stopping sparkcontext ? > > Appreciate your

Re: displaying "Test build" in PR

2019-08-13 Thread Wenchen Fan
"Can one of the admins verify this patch?" is a corrected message, as Jenkins won't test your PR until an admin approves it. BTW I think "5 minutes" is a reasonable delay for PR testing. It usually takes days to review and merge a PR, so I don't think seeing test progress right after PR creation

Re: [SPARK-23207] Repro

2019-08-12 Thread Wenchen Fan
Hi Tyson, Thanks for reporting it! I quickly checked the related scheduler code but can't find an obvious place that can go wrong with cached RDD. Sean said that he can't produce it, but the second job fails. This is actually expected. We need a lot more changes to completely fix this problem,

Re: Release Apache Spark 2.4.4

2019-08-13 Thread Wenchen Fan
+1 On Wed, Aug 14, 2019 at 12:52 PM Holden Karau wrote: > +1 > Does anyone have any critical fixes they’d like to see in 2.4.4? > > On Tue, Aug 13, 2019 at 5:22 PM Sean Owen wrote: > >> Seems fine to me if there are enough valuable fixes to justify another >> release. If there are any other

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-30 Thread Wenchen Fan
nal > effect of the null decision would be different depending on the insertion > target if the target has different behaviors for null. > > On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan wrote: > >> > I'm a big -1 on null values for invalid casts. >> >> This is why

Re: [build system] colo maintenance & outage tomorrow, 10am-2pm PDT

2019-08-15 Thread Wenchen Fan
Thanks for tracking it Shane! On Fri, Aug 16, 2019 at 7:41 AM Shane Knapp wrote: > just got an update: > > there was a problem w/the replacement part, and they're trying to fix it. > if that's successful, the expect to have power restored within the hour. > > if that doesn't work, a new (new)

Re: [VOTE] Release Apache Spark 2.4.4 (RC1)

2019-08-19 Thread Wenchen Fan
Unfortunately, I need to -1. Recently we found that the repartition correctness bug can still be reproduced. The root cause has been identified and there are 2 PRs to fix 2 related issues: https://github.com/apache/spark/pull/25491 https://github.com/apache/spark/pull/25498 I think we should

Re: Release Spark 2.3.4

2019-08-18 Thread Wenchen Fan
+1 On Sat, Aug 17, 2019 at 3:37 PM Hyukjin Kwon wrote: > +1 too > > 2019년 8월 17일 (토) 오후 3:06, Dilip Biswal 님이 작성: > >> +1 >> >> Regards, >> Dilip Biswal >> Tel: 408-463-4980 >> dbis...@us.ibm.com >> >> >> >> - Original message - >> From: John Zhuge >> To: Xiao Li >> Cc: Takeshi

Re: Apache Spark git repo moved to gitbox.apache.org

2019-08-26 Thread Wenchen Fan
yea I think we should, but no need to worry too much about it because gitbox still works in the release scripts. On Tue, Aug 27, 2019 at 3:23 AM Shane Knapp wrote: > revisiting this old thread... > > i noticed from the committers' page on the spark site that the 'apache' > remote should be

Re: [VOTE] Release Apache Spark 2.3.4 (RC1)

2019-08-27 Thread Wenchen Fan
+1 On Wed, Aug 28, 2019 at 2:43 AM DB Tsai wrote: > +1 > > Sincerely, > > DB Tsai > -- > Web: https://www.dbtsai.com > PGP Key ID: 42E5B25A8F7A82C1 > > On Tue, Aug 27, 2019 at 11:31 AM Dongjoon Hyun > wrote: > > > > +1. > > > > I also

Re: [VOTE] Release Apache Spark 2.4.4 (RC3)

2019-08-28 Thread Wenchen Fan
+1, no more blocking issues that I'm aware of. On Wed, Aug 28, 2019 at 8:33 PM Sean Owen wrote: > +1 from me again. > > On Tue, Aug 27, 2019 at 6:06 PM Dongjoon Hyun > wrote: > > > > Please vote on releasing the following candidate as Apache Spark version > 2.4.4. > > > > The vote is open

Re: JDK11 Support in Apache Spark

2019-08-25 Thread Wenchen Fan
Great work! On Sun, Aug 25, 2019 at 6:03 AM Xiao Li wrote: > Thank you for your contributions! This is a great feature for Spark > 3.0! We finally achieve it! > > Xiao > > On Sat, Aug 24, 2019 at 12:18 PM Felix Cheung > wrote: > >> That’s great! >> >> -- >> *From:*

Re: [ANNOUNCE] Announcing Apache Spark 2.4.4

2019-09-01 Thread Wenchen Fan
Great! Thanks! On Mon, Sep 2, 2019 at 5:55 AM Dongjoon Hyun wrote: > We are happy to announce the availability of Spark 2.4.4! > > Spark 2.4.4 is a maintenance release containing stability fixes. This > release is based on the branch-2.4 maintenance branch of Spark. We strongly > recommend all

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-12 Thread Wenchen Fan
5:28 PM > *To:* Alastair Green > *Cc:* Reynold Xin; Wenchen Fan; Spark dev list; Gengliang Wang > *Subject:* Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in > table insertion by default > > > We discussed this thread quite a bit in the DSv2 sync up and Russell >

Re: DSv2 sync - 4 September 2019

2019-09-09 Thread Wenchen Fan
Hi Nicholas, You are talking about a different thing. The PERMISSIVE mode is the failure mode for reading text-based data source (json, csv, etc.). It's not the general failure mode for Spark table insertion. I agree with you that the PERMISSIVE mode is hard to use. Feel free to open a JIRA

Re: Welcoming some new committers and PMC members

2019-09-09 Thread Wenchen Fan
Congratulations! On Tue, Sep 10, 2019 at 10:19 AM Yuanjian Li wrote: > Congratulations! > > sujith chacko 于2019年9月10日周二 上午10:15写道: > >> Congratulations all. >> >> On Tue, 10 Sep 2019 at 7:27 AM, Haibo wrote: >> >>> congratulations~ >>> >>> >>> >>> 在2019年09月10日 09:30,Joseph Torres >>> 写道: >>>

Re: [DISCUSS][SPIP][SPARK-29031] Materialized columns

2019-09-15 Thread Wenchen Fan
> 1. It is a waste of IO. The whole column (in Map format) should be read and Spark extract the required keys from the map, even though the query requires only one or a few keys in the map This sounds like a similar use case to nested column pruning. We should push down the map key extracting to

Re: Thoughts on Spark 3 release, or a preview release

2019-09-15 Thread Wenchen Fan
I don't expect to see a large DS V2 API change from now on. But we may update the API a little bit if we find problems during the preview. On Sat, Sep 14, 2019 at 10:16 PM Sean Owen wrote: > I don't think this suggests anything is finalized, including APIs. I > would not guess there will be

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-05 Thread Wenchen Fan
+1 To be honest I don't like the legacy policy. It's too loose and easy for users to make mistakes, especially when Spark returns null if a function hit errors like overflow. The strict policy is not good either. It's too strict and stops valid use cases like writing timestamp values to a date

Re: [Discuss] Follow ANSI SQL on table insertion

2019-08-05 Thread Wenchen Fan
, because > different users have different levels of tolerance for different kinds of > errors. I’d expect these sorts of configurations to be set up at an > infrastructure level, e.g. to maintain consistent standards throughout a > whole organization. > > > > *From: *Gengli

Re: Re: How to force sorted merge join to broadcast join

2019-07-29 Thread Wenchen Fan
You can try EXPLAIN COST query and see if it works for you. On Mon, Jul 29, 2019 at 5:34 PM Rubén Berenguel wrote: > I think there is no way of doing that (at least don't remember one right > now). The closer I remember now, is you can run the SQL "ANALYZE TABLE > table_name COMPUTE STATISTIC"

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-29 Thread Wenchen Fan
behavior >>> anyway. >>> Eventually, most sources are supposed to be migrated to DataSourceV2 V2. >>> I think we can discuss and make a decision now. >>> >>> > Fixing the silent corruption by adding a runtime exception is not a >>> good opti

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-26 Thread Wenchen Fan
mals and fail simple insert > statements. We already came up with two alternatives to fix that problem in > the DSv2 sync and I think it is a better idea to go with one of those > instead of “fixing” Spark in a way that will corrupt data or cause runtime > failures. > > On T

Re: [Discuss] Follow ANSI SQL on table insertion

2019-08-05 Thread Wenchen Fan
defaults, given that there are a lot > of varying opinions on this thread. > > On Mon, Aug 5, 2019 at 12:49 AM Wenchen Fan wrote: > >> I think we need to clarify one thing before further discussion: the >> proposal is for the next release but not a long term solutio

Re: DataSourceV2 : Transactional Write support

2019-08-05 Thread Wenchen Fan
I agree with the temp table approach. One idea is: maybe we only need one temp table, and each task writes to this temp table. At the end we read the data from the temp table and write it to the target table. AFAIK JDBC can handle concurrent table writing very well, and it's better than creating

Re: DataSourceV2 sync notes - 10 July 2019

2019-07-23 Thread Wenchen Fan
Hi Ryan, Thanks for summarizing and sending out the meeting notes! Unfortunately, I missed the last sync, but the topics are really interesting, especially the stats integration. The ideal solution I can think of is to refactor the optimizer/planner and move all the stats-based optimization to

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-25 Thread Wenchen Fan
I have heard about many complaints about the old table insertion behavior. Blindly casting everything will leak the user mistake to a late stage of the data pipeline, and make it very hard to debug. When a user writes string values to an int column, it's probably a mistake and the columns are

Re: Spark 3.0 preview release on-going features discussion

2019-09-20 Thread Wenchen Fan
> New pushdown API for DataSourceV2 One correction: I want to revisit the pushdown API to make sure it works for dynamic partition pruning and can be extended to support limit/aggregate/... pushdown in the future. It should be a small API update instead of a new API. On Fri, Sep 20, 2019 at 3:46

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Wenchen Fan
dynamic partition pruning rule generates "hidden" filters that will be converted to real predicates at runtime, so it doesn't matter where we run the rule. For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's better to run it before join reorder. On Sun, Sep 29, 2019 at 5:51 AM

Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

2019-11-10 Thread Wenchen Fan
shuffle hash join? Like code generation for ShuffledHashJoinExec or > something…. > > > > *From: *Wenchen Fan > *Date: *Sunday, November 10, 2019 at 5:57 PM > *To: *"Wang, Gang" > *Cc: *"dev@spark.apache.org" > *Subject: *Re: Why not implement CodegenSupport

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-06 Thread Wenchen Fan
Sounds reasonable to me. We should make the behavior consistent within Spark. On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler wrote: > Currently, when a PySpark Row is created with keyword arguments, the > fields are sorted alphabetically. This has created a lot of confusion with > users because it

Re: [DISCUSS] Expensive deterministic UDFs

2019-11-07 Thread Wenchen Fan
We really need some documents to define what non-deterministic means. AFAIK, non-deterministic expressions may produce a different result for the same input row, if the already processed input rows are different. The optimizer tries its best to not change the input sequence of non-deterministic

Re: [VOTE] SPARK 3.0.0-preview (RC2)

2019-11-01 Thread Wenchen Fan
The PR builder uses Hadoop 2.7 profile, which makes me think that 2.7 is more stable and we should make releases using 2.7 by default. +1 On Fri, Nov 1, 2019 at 7:16 AM Xiao Li wrote: > Spark 3.0 will still use the Hadoop 2.7 profile by default, I think. > Hadoop 2.7 profile is much more

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-16 Thread Wenchen Fan
Do we have a limitation on the number of pre-built distributions? Seems this time we need 1. hadoop 2.7 + hive 1.2 2. hadoop 2.7 + hive 2.3 3. hadoop 3 + hive 2.3 AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so don't need to add JDK version to the combination. On Sat, Nov

Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

2019-11-10 Thread Wenchen Fan
By default sort merge join is preferred over shuffle hash join, that's why we haven't spend resources to implement codegen for it. On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang wrote: > There are some cases, shuffle hash join performs even better than sort > merge join. > > While, I noticed that

Re: Fw:Re:Re: A question about radd bytes size

2019-12-02 Thread Wenchen Fan
发件人:"zhangliyun" > 发送日期:2019-12-03 05:56:55 > 收件人:"Wenchen Fan" > 主题:Re:Re: A question about radd bytes size > > Hi Fan: >thanks for reply, I agree that the how the data is stored decides the > total bytes of the table file. > In my experiment, I fou

Re: Slower than usual on PRs

2019-12-02 Thread Wenchen Fan
Sorry to hear that. Hope you get better soon! On Tue, Dec 3, 2019 at 1:28 AM Holden Karau wrote: > Hi Spark dev folks, > > Just an FYI I'm out dealing with recovering from a motorcycle accident so > my lack of (or slow) responses on PRs/docs is health related and please > don't block on any of

Re: [DISCUSS] Consistent relation resolution behavior in SparkSQL

2019-12-04 Thread Wenchen Fan
> proposal > <https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing> > . > > Note that this proposal is a breaking change, but the impact should be > minimal since this applies only when there are temp views and tables with > the same name. >

Re: [DISCUSS] Add close() on DataWriter interface

2019-12-10 Thread Wenchen Fan
PartitionReader extends Closable, seems reasonable to me to do the same for DataWriter. On Wed, Dec 11, 2019 at 1:35 PM Jungtaek Lim wrote: > Hi devs, > > I'd like to propose to add close() on DataWriter explicitly, which is the > place for resource cleanup. > > The rationalization of the

Re: Release Apache Spark 2.4.5 and 2.4.6

2019-12-10 Thread Wenchen Fan
Sounds good. Thanks for bringing this up! On Wed, Dec 11, 2019 at 3:18 PM Takeshi Yamamuro wrote: > That looks nice, thanks! > I checked the previous v2.4.4 release; it has around 130 commits (from > 2.4.3 to 2.4.4), so > I think branch-2.4 already has enough commits for the next release. > > A

Re: I would like to add JDBCDialect to support Vertica database

2019-12-11 Thread Wenchen Fan
Can we make the JDBCDialect a public API that users can plugin? It looks like an end-less job to make sure Spark JDBC source supports all databases. On Wed, Dec 11, 2019 at 11:41 PM Xiao Li wrote: > You can follow how we test the other JDBC dialects. All JDBC dialects > require the docker

Re: DataSourceWriter V2 Api questions

2019-12-05 Thread Wenchen Fan
ng tables on a periodic basis. >> >> It gets messy and probably moves you towards a write-once only tables, >> etc. >> >> >> >> Finally using views in a generic mongoDB connector may not be good and >> flexible enough. >> >> &

Re: branch-3.0 vs branch-3.0-preview (?)

2019-10-16 Thread Wenchen Fan
Does anybody remember what we did for 2.0 preview? Personally I'd like to avoid cutting branch-3.0 right now, otherwise we need to merge PRs into two branches in the following several months. Thanks, Wenchen On Wed, Oct 16, 2019 at 3:01 PM Xingbo Jiang wrote: > Hi Dongjoon, > > I'm not sure

Re: Re: A question about broadcast nest loop join

2019-10-23 Thread Wenchen Fan
Ah sorry I made a mistake. "Spark can only pick BroadcastNestedLoopJoin to implement left/right join" this should be "left/right non-equal join" On Thu, Oct 24, 2019 at 6:32 AM zhangliyun wrote: > > Hi Herman: >I guess what you mentioned before > ``` > if you are OK with slightly different

Re: A question about broadcast nest loop join

2019-10-23 Thread Wenchen Fan
I haven't looked into your query yet, just want to let you know that: Spark can only pick BroadcastNestedLoopJoin to implement left/right join. If the table is very big, then OOM happens. Maybe there is an algorithm to implement left/right join in a distributed environment without broadcast, but

Re: Apache Spark 3.0 timeline

2019-10-16 Thread Wenchen Fan
> I figure we are probably moving to code freeze late in the year, release early next year? Sounds good! On Thu, Oct 17, 2019 at 7:51 AM Dongjoon Hyun wrote: > Thanks! That sounds reasonable. I'm +1. :) > > Historically, 2.0-preview was on May 2016 and 2.0 was on July, 2016. 3.0 > seems to be

Re: DataSourceV2 sync notes - 2 October 2019

2019-10-18 Thread Wenchen Fan
Ryan Blue wrote: > Here are my notes from last week's DSv2 sync. > > *Attendees*: > > Ryan Blue > Terry Kim > Wenchen Fan > > *Topics*: > >- SchemaPruning only supports Parquet and ORC? >- Out of order optimizer rules >- 3.0 work >

[DISCUSS] PostgreSQL dialect

2019-11-26 Thread Wenchen Fan
Hi all, Recently we start an effort to achieve feature parity between Spark and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764 This goes very well. We've added many missing features(parser rules, built-in functions, etc.) to Spark, and also corrected several inappropriate

Re: A question about radd bytes size

2019-12-01 Thread Wenchen Fan
When we talk about bytes size, we need to specify how the data is stored. For example, if we cache the dataframe, then the bytes size is the number of bytes of the binary format of the table cache. If we write to hive tables, then the bytes size is the total size of the data files of the table.

Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

2019-10-07 Thread Wenchen Fan
AFAIK there is no public streaming data source API before DS v2. The Source and Sink API is private and is only for builtin streaming sources. Advanced users can still implement custom stream sources with private Spark APIs (you can put your classes under the org.apache.spark.sql package to access

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-07 Thread Wenchen Fan
+1 I think this is the most reasonable default behavior among the three. On Mon, Oct 7, 2019 at 6:06 PM Alessandro Solimando < alessandro.solima...@gmail.com> wrote: > +1 (non-binding) > > I have been following this standardization effort and I think it is sound > and it provides the needed

Re: [build system] IMPORTANT! northern california fire danger, potential power outage(s)

2019-10-09 Thread Wenchen Fan
Thanks for the updates! On Thu, Oct 10, 2019 at 5:34 AM Shane Knapp wrote: > quick update: > > campus is losing power @ 8pm. this is after we were told 4am, 8am, > noon, and 2-4pm. :) > > PG expects to start bringing alameda county back online at noon > tomorrow, but i believe that target to

Re: Spark 3.0 preview release feature list and major changes

2019-10-08 Thread Wenchen Fan
Regarding DS v2, I'd like to remove SPARK-26785 data source v2 API refactor: streaming write SPARK-26956 remove streaming output mode from data source v2 APIs and put the umbrella ticket

Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

2019-10-08 Thread Wenchen Fan
d to apply "package hack" but also need to > depend on catalyst. > > > On Mon, Oct 7, 2019 at 9:45 PM Wenchen Fan wrote: > >> AFAIK there is no public streaming data source API before DS v2. The >> Source and Sink API is private and is only for builtin streaming sourc

Re: [DISCUSS] ViewCatalog interface for DSv2

2019-10-14 Thread Wenchen Fan
I'm fine with the view definition proposed here, but my major concern is how to make sure table/view share the same namespace. According to the SQL spec, if there is a view named "a", we can't create a table named "a" anymore. We can add documents and ask the implementation to guarantee it, but

Re: [VOTE] SPARK 3.0.0-preview2 (RC2)

2019-12-18 Thread Wenchen Fan
+1, all tests pass On Thu, Dec 19, 2019 at 7:18 AM Takeshi Yamamuro wrote: > Thanks, Yuming! > > I checked the links and the prepared binaries. > Also, I run tests with -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver > -Pmesos -Pkubernetes -Psparkr > on java version "1.8.0_181. > All the things

Re: how to get partition column info in Data Source V2 writer

2019-12-18 Thread Wenchen Fan
Hi Aakash, You can try the latest DS v2 with the 3.0 preview, and the API is in a quite stable shape now. With the latest API, a Writer is created from a Table, and the Table has the partitioning information. Thanks, Wenchen On Wed, Dec 18, 2019 at 3:22 AM aakash aakash wrote: > Thanks

Re: Adaptive Query Execution performance results in 3TB TPC-DS

2020-02-13 Thread Wenchen Fan
Thanks for providing the benchmark numbers! The result is very promising and I'm looking forward to seeing more feedback from real-world workloads. On Wed, Feb 12, 2020 at 3:43 PM Jia, Ke A wrote: > Hi all, > > We have completed the Spark 3.0 Adaptive Query Execution(AQE) performance > tests in

Re: [ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-10 Thread Wenchen Fan
Great Job, Dongjoon! On Mon, Feb 10, 2020 at 4:18 PM Hyukjin Kwon wrote: > Thanks Dongjoon! > > 2020년 2월 9일 (일) 오전 10:49, Takeshi Yamamuro 님이 작성: > >> Happy to hear the release news! >> >> Bests, >> Takeshi >> >> On Sun, Feb 9, 2020 at 10:28 AM Dongjoon Hyun >> wrote: >> >>> There was a typo

Re: Request to document the direct relationship between other configurations

2020-02-12 Thread Wenchen Fan
In general I think it's better to have more detailed documents, but we don't have to force everyone to do it if the config name is structured. I would +1 to document the relationship of we can't tell it from the config names, e.g. spark.shuffle.service.enabled and spark.dynamicAllocation.enabled.

[DISCUSS] naming policy of Spark configs

2020-02-12 Thread Wenchen Fan
Hi all, I'd like to discuss the naming policy of Spark configs, as for now it depends on personal preference which leads to inconsistent namings. In general, the config name should be a noun that describes its meaning clearly. Good examples: spark.sql.session.timeZone

Re: Datasource V2 support in Spark 3.x

2020-03-05 Thread Wenchen Fan
Data Source V2 has evolved to Connector API which supports both data (the data source API) and metadata (the catalog API). The new APIs are under package org.apache.spark.sql.connector You can keep using Data Source V1 as there is no plan to deprecate it in the near future. But if you'd like to

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-23 Thread Wenchen Fan
t; Also, the followup JIRA requested seems still open >>>> https://issues.apache.org/jira/browse/SPARK-27386 >>>> I heard this was already discussed but I cannot find the summary of the >>>> meeting or any documentation. >>>> >>>>

<    1   2   3   4   5   6   >