Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-08 Thread Wenchen Fan
is specified. This gives us more time to think about how to do it in 3.1. If you have other ideas, please reply to this thread. Thanks, Wenchen On Thu, Mar 26, 2020 at 7:28 AM Jungtaek Lim wrote: > Thanks, filed SPARK-31257 > <https://issues.apache.org/jira/browse/SPARK-31257>.

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

2020-05-08 Thread Wenchen Fan
Can you give some simple examples to demonstrate the problem? I think the inconsistency would bring problems but don't know how. On Fri, May 8, 2020 at 3:49 PM Jungtaek Lim wrote: > (bump to expose the discussion to more readers) > > On Mon, May 4, 2020 at 4:57 PM Jungtaek Lim > wrote: > >> Hi

Re: is there any tool to visualize the spark physical plan or spark plan

2020-04-30 Thread Wenchen Fan
Does the Spark SQL web UI work for you? https://spark.apache.org/docs/3.0.0-preview/web-ui.html#sql-tab On Thu, Apr 30, 2020 at 5:30 PM Manu Zhang wrote: > Hi Kelly, > > If you can parse event log, then try listening on > `SparkListenerSQLExecutionStart` event and build a `SparkPlanGraph` like

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-27 Thread Wenchen Fan
IIUC We are moving away from having 2 classes for Java and Scala, like JavaRDD and RDD. It's much simpler to maintain and use with a single class. I don't have a strong preference over option 3 or 4. We may need to collect more data points from actual users. On Mon, Apr 27, 2020 at 9:50 PM

Re: Error while reading hive tables with tmp/hidden files inside partitions

2020-04-23 Thread Wenchen Fan
ards,Dhrubajyoti Hati.* > > > On Thu, Apr 23, 2020 at 10:12 AM Wenchen Fan wrote: > >> This looks like a bug that path filter doesn't work for hive table >> reading. Can you open a JIRA ticket? >> >> On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati >

Re: Error while reading hive tables with tmp/hidden files inside partitions

2020-04-22 Thread Wenchen Fan
This looks like a bug that path filter doesn't work for hive table reading. Can you open a JIRA ticket? On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati wrote: > Just wondering if any one could help me out on this. > > Thank you! > > > > > *Regards,Dhrubajyoti Hati.* > > > On Wed, Apr 22, 2020

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-09 Thread Wenchen Fan
ng) to RC1 Please reply to this thread if you know more critical issues that should be fixed before 3.0. Thanks, Wenchen On Fri, Apr 10, 2020 at 10:01 AM Xiao Li wrote: > Only the low-risk or high-value bug fixes, and the documentation changes > are allowed to merge to branch-3.

Re: DSv2 & DataSourceRegister

2020-04-08 Thread Wenchen Fan
llo > > On Tue, Apr 7, 2020 at 23:16 Wenchen Fan wrote: > >> Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not >> sure this is possible as the DS V2 API is very different in 3.0, e.g. there >> is no `DataSourceV2` anymore, and you should implement `T

Re: DSv2 & DataSourceRegister

2020-04-07 Thread Wenchen Fan
Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not sure this is possible as the DS V2 API is very different in 3.0, e.g. there is no `DataSourceV2` anymore, and you should implement `TableProvider` (if you don't have database/table). On Wed, Apr 8, 2020 at 6:58 AM Andrew

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-03-31 Thread Wenchen Fan
Yea, release candidates are different from the preview version, as release candidates are not official releases, so they won't appear in Maven Central, can't be downloaded in the Spark official website, etc. On Wed, Apr 1, 2020 at 12:32 PM Sean Owen wrote: > These are release candidates, not

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-29 Thread Wenchen Fan
I agree that we can cut the RC anyway even if there are blockers, to move us to a more official "code freeze" status. About the CREATE TABLE unification, it's still WIP and not close-to-merge yet. Can we fix some specific problems like CREATE EXTERNAL TABLE surgically and leave the unification to

Re: Programmatic: parquet file corruption error

2020-03-27 Thread Wenchen Fan
Running Spark application with an IDE is not officially supported. It may work under some cases but there is no guarantee at all. The official way is to run interactive queries with spark-shell or package your application to a jar and use spark-submit. On Thu, Mar 26, 2020 at 4:12 PM Zahid Rahman

Re: BUG: take with SparkSession.master[url]

2020-03-27 Thread Wenchen Fan
g from maven. > > Backbutton.co.uk > ¯\_(ツ)_/¯ > ♡۶Java♡۶RMI ♡۶ > Make Use Method {MUM} > makeuse.org > <http://www.backbutton.co.uk> > > > On Fri, 27 Mar 2020 at 05:45, Wenchen Fan wrote: > >> Which Spark/Scala version do you use? >

Re: BUG: take with SparkSession.master[url]

2020-03-26 Thread Wenchen Fan
Which Spark/Scala version do you use? On Fri, Mar 27, 2020 at 1:24 PM Zahid Rahman wrote: > > with the following sparksession configuration > > val spark = SparkSession.builder().master("local[*]").appName("Spark Session > take").getOrCreate(); > > this line works > > flights.filter(flight_row

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-24 Thread Wenchen Fan
Hi Ryan, It's great to hear that you are cleaning up this long-standing mess. Please let me know if you hit any problems that I can help with. Thanks, Wenchen On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas wrote: > On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan wrote: > >> 2.

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-19 Thread Wenchen Fan
with Hive connected. >>>> >>>> But since we are even thinking about native syntax as a first class and >>>> dropping Hive one implicitly (hide in doc) or explicitly, does it really >>>> matter we require a marker (like "HIVE") in rule

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Wenchen Fan
entation to make things be clear, but if the approach >> would be explaining the difference of rules and guide the tip to make the >> query be bound to the specific rule, the same could be applied to parser >> rule to address the root cause. >> >> >> On Wed, M

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Wenchen Fan
ry but I think we are making bad assumption on end users which is a > serious problem. > > If we really want to promote Spark's one for CREATE TABLE, then would it > really matter to treat Hive CREATE TABLE be "exceptional" one and try to > isolate each other? What's the point of

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Wenchen Fan
I think the general guideline is to promote Spark's own CREATE TABLE syntax instead of the Hive one. Previously these two rules are mutually exclusive because the native syntax requires the USING clause while the Hive syntax makes ROW FORMAT or STORED AS clause optional. It's a good move to make

Re: Spark 2.4.x and 3.x datasourcev2 api documentation & references

2020-03-18 Thread Wenchen Fan
For now you can take a look at `DataSourceV2Suite`, which contains both Java/Scala implementations. There is also an ongoing PR to implement catalog APIs for JDBC: https://github.com/apache/spark/pull/27345 We are still working on the user guide. On Mon, Mar 16, 2020 at 4:59 AM MadDoxX wrote:

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Wenchen Fan
. should be updated as well. b) Simply document that, the underlying data source may or may not enforce the length limitation of VARCHAR(x). Please let me know if you have different ideas. Thanks, Wenchen On Wed, Mar 18, 2020 at 1:08 AM Michael Armbrust wrote: > What I'd oppose is to just

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Wenchen Fan
changing here. Any ideas are welcome! Thanks, Wenchen On Tue, Mar 17, 2020 at 11:29 AM Stephen Coy wrote: > I don’t think I can recall any usages of type CHAR in any situation. > > Really, it’s only use (on any traditional SQL database) would be when you > *want* a fixed width char

Re: [DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

2020-03-17 Thread Wenchen Fan
I don't think option 1 is possible. For option 2: I think we need to do it anyway. It's kind of a bug that the typed Scala UDF doesn't support case class that thus can't support struct-type input columns. For option 3: It's a bit risky to add a new API but seems like we have a good reason. The

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-09 Thread Wenchen Fan
+1 (binding), assuming that this is for public stable APIs, not APIs that are marked as unstable, evolving, etc. On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía wrote: > +1 (non-binding) > > Michael's section on the trade-offs of maintaining / removing an API are > one of > the best reads I have

Re: Datasource V2 support in Spark 3.x

2020-03-05 Thread Wenchen Fan
Data Source V2 has evolved to Connector API which supports both data (the data source API) and metadata (the catalog API). The new APIs are under package org.apache.spark.sql.connector You can keep using Data Source V1 as there is no plan to deprecate it in the near future. But if you'd like to

Re: [DISCUSSION] Avoiding duplicate work

2020-02-21 Thread Wenchen Fan
The JIRA ticket will show the linked PR if there are any, which indicates that someone is working on it if the PR is active. Maybe the bot should also leave a comment on the JIRA ticket to make it clearer? On Fri, Feb 21, 2020 at 10:54 PM younggyu Chun wrote: > Hi All, > > I would like to

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-18 Thread Wenchen Fan
I don't know what's the best way to deprecate an SQL function. Runtime warning can be annoying if it keeps coming out. Maybe we should only log the warning once per Spark application. On Tue, Feb 18, 2020 at 3:45 PM Dongjoon Hyun wrote: > Thank you for feedback, Wenchen, Maxim, and Take

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-15 Thread Wenchen Fan
of "fixing" the parameter order that worth to make a breaking change. Thanks, Wenchen On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun wrote: > Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and > TRIM(trimStr FROM str) syntax. > > This thread is irrelevant

Re: Adaptive Query Execution performance results in 3TB TPC-DS

2020-02-13 Thread Wenchen Fan
Thanks for providing the benchmark numbers! The result is very promising and I'm looking forward to seeing more feedback from real-world workloads. On Wed, Feb 12, 2020 at 3:43 PM Jia, Ke A wrote: > Hi all, > > We have completed the Spark 3.0 Adaptive Query Execution(AQE) performance > tests in

Re: [DISCUSS] naming policy of Spark configs

2020-02-12 Thread Wenchen Fan
e/timestamp as value AFAIK. Thanks, Wenchen On Thu, Feb 13, 2020 at 11:29 AM Jungtaek Lim wrote: > +1 Thanks for the proposal. Looks very reasonable to me. > > On Thu, Feb 13, 2020 at 10:53 AM Hyukjin Kwon wrote: > >> +1. >> >> 2020년 2월 13일 (목) 오전 9:30, Genglian

[DISCUSS] naming policy of Spark configs

2020-02-12 Thread Wenchen Fan
and you can't find a good verb for the feature, featureName.enabled is also good. I'll update https://spark.apache.org/contributing.html after we reach a consensus here. Any comments are welcome! Thanks, Wenchen

Re: Request to document the direct relationship between other configurations

2020-02-12 Thread Wenchen Fan
In general I think it's better to have more detailed documents, but we don't have to force everyone to do it if the config name is structured. I would +1 to document the relationship of we can't tell it from the config names, e.g. spark.shuffle.service.enabled and spark.dynamicAllocation.enabled.

Re: comparable and orderable CalendarInterval

2020-02-11 Thread Wenchen Fan
What's your use case to compare intervals? It's tricky in Spark as there is only one interval type and you can't really compare one month with 30 days. On Wed, Feb 12, 2020 at 12:01 AM Enrico Minack wrote: > Hi Devs, > > I would like to know what is the current roadmap of making >

Re: [ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-10 Thread Wenchen Fan
Great Job, Dongjoon! On Mon, Feb 10, 2020 at 4:18 PM Hyukjin Kwon wrote: > Thanks Dongjoon! > > 2020년 2월 9일 (일) 오전 10:49, Takeshi Yamamuro 님이 작성: > >> Happy to hear the release news! >> >> Bests, >> Takeshi >> >> On Sun, Feb 9, 2020 at 10:28 AM Dongjoon Hyun >> wrote: >> >>> There was a typo

Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

2020-02-05 Thread Wenchen Fan
This is a hack really and we don't recommend users to access internal classes directly. That's why there is no public document. If you really need to do it and are aware of the risks, you can read the source code. All expressions (or the so-called "native UDF") extend the base class `Expression`.

Re: [VOTE] Release Apache Spark 2.4.5 (RC2)

2020-02-03 Thread Wenchen Fan
AFAIK there is no ongoing critical bug fixes, +1 On Mon, Feb 3, 2020 at 11:46 PM Dongjoon Hyun wrote: > Yes, it does officially since 2.4.0. > > 2.4.5 is a maintenance release of 2.4.x line and the community didn't > support Hadoop 3.x on 'branch-2.4'. We didn't run test at all. > > Bests, >

Re: [FYI] `Target Version` on `Improvement`/`New Feature` JIRA issues

2020-02-02 Thread Wenchen Fan
Thanks for cleaning this up! On Sun, Feb 2, 2020 at 2:08 PM Xiao Li wrote: > Thanks! Dongjoon. > > Xiao > > On Sat, Feb 1, 2020 at 5:15 PM Hyukjin Kwon wrote: > >> Thanks Dongjoon. >> >> On Sun, 2 Feb 2020, 09:08 Dongjoon Hyun, wrote: >> >>> Hi, All. >>> >>> From Today, we have `branch-3.0`

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-23 Thread Wenchen Fan
to call a java function to do the partitioning. This is different from UDF as UDF means someone gives a function and ask Spark to run. Partitioning is the opposite. Hope this helps. Thanks, Wenchen On Thu, Jan 23, 2020 at 3:42 PM Hyukjin Kwon wrote: > There's another PR open to expose this m

Re: Enabling push-based shuffle in Spark

2020-01-23 Thread Wenchen Fan
The name "push-based shuffle" is a little misleading. This seems like a better shuffle service that co-locates shuffle blocks of one reducer at the map phase. I think this is a good idea. Is it possible to make it completely external via the shuffle plugin API? This looks like a good use case of

Re: Correctness and data loss issues

2020-01-21 Thread Wenchen Fan
I think we need to go through them during the 3.0 QA period, and try to fix the valid ones. For example, the first ticket should be fixed already in https://issues.apache.org/jira/browse/SPARK-28344 On Mon, Jan 20, 2020 at 2:07 PM Dongjoon Hyun wrote: > Hi, All. > > According to our policy,

Re: [Discuss] Metrics Support for DS V2

2020-01-17 Thread Wenchen Fan
I think there are a few details we need to discuss. how frequently a source should update its metrics? For example, if file source needs to report size metrics per row, it'll be super slow. what metrics a source should report? data size? numFiles? read time? shall we show metrics in SQL web UI

Re: [DISCUSS] Support year-month and day-time Intervals

2020-01-16 Thread Wenchen Fan
The proposal makes sense to me. If we are not going to make interval type ANSI-compliant in this release, we should not expose it widely. Thanks for driving it, Kent! On Fri, Jan 17, 2020 at 10:52 AM Dr. Kent Yao wrote: > Following ANSI might be a good option but also a serious user behavior >

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Wenchen Fan
The DS v2 project is still evolving so half-backed is inevitable sometimes. This feature is definitely in the right direction to allow more flexible partition implementations, but there are a few problems we can discuss. About expression duplication. This is an existing design choice. We don't

Re: [VOTE] Release Apache Spark 2.4.5 (RC1)

2020-01-15 Thread Wenchen Fan
Recently we merged several fixes to 2.4: https://issues.apache.org/jira/browse/SPARK-30325 a driver hang issue https://issues.apache.org/jira/browse/SPARK-30246 a memory leak issue https://issues.apache.org/jira/browse/SPARK-29708 a correctness issue(for a rarely used feature, so not merged

Re: Question about Datasource V2

2020-01-13 Thread Wenchen Fan
1. we plan to add view support in future releases. 2. can you open a JIRA ticket? This seems like a bug to me. 3. instead of defining a lot of fields in the table, we decide to use properties to keep all the extra information. We've defined some reserved properties like "comment", "location",

Re: How executor Understand which RDDs needed to be persist from the submitted Task

2020-01-09 Thread Wenchen Fan
t; > Iacovos > On 1/9/20 5:03 PM, Wenchen Fan wrote: > > RDD has a flag `storageLevel` which will be set by calling persist(). RDD > will be serialized and sent to executors for running tasks. So executors > just look at RDD.storageLevel and store output in its block manager wh

Re: How executor Understand which RDDs needed to be persist from the submitted Task

2020-01-09 Thread Wenchen Fan
RDD has a flag `storageLevel` which will be set by calling persist(). RDD will be serialized and sent to executors for running tasks. So executors just look at RDD.storageLevel and store output in its block manager when needed. On Thu, Jan 9, 2020 at 5:53 PM Jack Kolokasis wrote: > Hello all, >

Re: [SPARK-30319][SQL] Add a stricter version of as[T]

2020-01-07 Thread Wenchen Fan
I think it's simply because as[T] is lazy. You will see the right schema if you do `df.as[T].map(identity)`. On Tue, Jan 7, 2020 at 4:42 PM Enrico Minack wrote: > Hi Devs, > > I'd like to propose a stricter version of as[T]. Given the interface def > as[T](): Dataset[T], it is

Re: [DISCUSS] Support subdirectories when accessing partitioned Parquet Hive table

2020-01-06 Thread Wenchen Fan
Isn't your directory structure malformed? The directory name under the table path should be in the form of "partitionCol=value". And AFAIK this is the Hive standard. On Mon, Jan 6, 2020 at 6:59 PM Lotkowski, Michael wrote: > Hi all, > > > > Reviving this thread, we still have this issue and

Re: Release Apache Spark 2.4.5

2020-01-05 Thread Wenchen Fan
+1 On Mon, Jan 6, 2020 at 12:02 PM Jungtaek Lim wrote: > +1 to have another Spark 2.4 release, as Spark 2.4.4 was released in 4 > months old and there's release window for this. > > On Mon, Jan 6, 2020 at 12:38 PM Hyukjin Kwon wrote: > >> Yeah, I think it's nice to have another maintenance

Re: Fw:Re: [VOTE][SPARK-29018][SPIP]:Build spark thrift server based on protocol v11

2019-12-29 Thread Wenchen Fan
+1 for the new thrift server to get rid of the Hive dependencies! On Mon, Dec 23, 2019 at 7:55 PM Yuming Wang wrote: > I'm +1 for this SPIP for these two reasons: > > 1. The current thriftserver has some issues that are not easy to solve, > such as: SPARK-28636

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2019-12-23 Thread Wenchen Fan
Sounds good! On Tue, Dec 24, 2019 at 7:48 AM Reynold Xin wrote: > We've pushed out 3.0 multiple times. The latest release window documented > on the website says we'd > code freeze and cut branch-3.0 early Dec. It looks like we are suffering a >

Re: [VOTE] SPARK 3.0.0-preview2 (RC2)

2019-12-18 Thread Wenchen Fan
+1, all tests pass On Thu, Dec 19, 2019 at 7:18 AM Takeshi Yamamuro wrote: > Thanks, Yuming! > > I checked the links and the prepared binaries. > Also, I run tests with -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver > -Pmesos -Pkubernetes -Psparkr > on java version "1.8.0_181. > All the things

Re: how to get partition column info in Data Source V2 writer

2019-12-18 Thread Wenchen Fan
Hi Aakash, You can try the latest DS v2 with the 3.0 preview, and the API is in a quite stable shape now. With the latest API, a Writer is created from a Table, and the Table has the partitioning information. Thanks, Wenchen On Wed, Dec 18, 2019 at 3:22 AM aakash aakash wrote: > Thanks And

Re: I would like to add JDBCDialect to support Vertica database

2019-12-11 Thread Wenchen Fan
Can we make the JDBCDialect a public API that users can plugin? It looks like an end-less job to make sure Spark JDBC source supports all databases. On Wed, Dec 11, 2019 at 11:41 PM Xiao Li wrote: > You can follow how we test the other JDBC dialects. All JDBC dialects > require the docker

Re: Release Apache Spark 2.4.5 and 2.4.6

2019-12-10 Thread Wenchen Fan
Sounds good. Thanks for bringing this up! On Wed, Dec 11, 2019 at 3:18 PM Takeshi Yamamuro wrote: > That looks nice, thanks! > I checked the previous v2.4.4 release; it has around 130 commits (from > 2.4.3 to 2.4.4), so > I think branch-2.4 already has enough commits for the next release. > > A

Re: [DISCUSS] Add close() on DataWriter interface

2019-12-10 Thread Wenchen Fan
PartitionReader extends Closable, seems reasonable to me to do the same for DataWriter. On Wed, Dec 11, 2019 at 1:35 PM Jungtaek Lim wrote: > Hi devs, > > I'd like to propose to add close() on DataWriter explicitly, which is the > place for resource cleanup. > > The rationalization of the

Re: DataSourceWriter V2 Api questions

2019-12-05 Thread Wenchen Fan
ng tables on a periodic basis. >> >> It gets messy and probably moves you towards a write-once only tables, >> etc. >> >> >> >> Finally using views in a generic mongoDB connector may not be good and >> flexible enough. >> >> &

Re: [DISCUSS] Consistent relation resolution behavior in SparkSQL

2019-12-04 Thread Wenchen Fan
to lookup tables: one for SELECT/INSERT and one for other commands. Thanks, Wenchen On Mon, Dec 2, 2019 at 9:12 AM Terry Kim wrote: > Hi all, > > As discussed in SPARK-29900, Spark currently has two different relation > resolution behaviors: > >1. Look up temp view

Re: Slower than usual on PRs

2019-12-02 Thread Wenchen Fan
Sorry to hear that. Hope you get better soon! On Tue, Dec 3, 2019 at 1:28 AM Holden Karau wrote: > Hi Spark dev folks, > > Just an FYI I'm out dealing with recovering from a motorcycle accident so > my lack of (or slow) responses on PRs/docs is health related and please > don't block on any of

Re: Fw:Re:Re: A question about radd bytes size

2019-12-02 Thread Wenchen Fan
发件人:"zhangliyun" > 发送日期:2019-12-03 05:56:55 > 收件人:"Wenchen Fan" > 主题:Re:Re: A question about radd bytes size > > Hi Fan: >thanks for reply, I agree that the how the data is stored decides the > total bytes of the table file. > In my experiment, I fou

Re: A question about radd bytes size

2019-12-01 Thread Wenchen Fan
When we talk about bytes size, we need to specify how the data is stored. For example, if we cache the dataframe, then the bytes size is the number of bytes of the binary format of the table cache. If we write to hive tables, then the bytes size is the total size of the data files of the table.

[DISCUSS] PostgreSQL dialect

2019-11-26 Thread Wenchen Fan
files of PostgreSQL tests. Any comments are welcome! Thanks, Wenchen

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-16 Thread Wenchen Fan
Do we have a limitation on the number of pre-built distributions? Seems this time we need 1. hadoop 2.7 + hive 1.2 2. hadoop 2.7 + hive 2.3 3. hadoop 3 + hive 2.3 AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so don't need to add JDK version to the combination. On Sat, Nov

Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

2019-11-10 Thread Wenchen Fan
shuffle hash join? Like code generation for ShuffledHashJoinExec or > something…. > > > > *From: *Wenchen Fan > *Date: *Sunday, November 10, 2019 at 5:57 PM > *To: *"Wang, Gang" > *Cc: *"dev@spark.apache.org" > *Subject: *Re: Why not implement CodegenSupport

Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

2019-11-10 Thread Wenchen Fan
By default sort merge join is preferred over shuffle hash join, that's why we haven't spend resources to implement codegen for it. On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang wrote: > There are some cases, shuffle hash join performs even better than sort > merge join. > > While, I noticed that

Re: [DISCUSS] Expensive deterministic UDFs

2019-11-07 Thread Wenchen Fan
We really need some documents to define what non-deterministic means. AFAIK, non-deterministic expressions may produce a different result for the same input row, if the already processed input rows are different. The optimizer tries its best to not change the input sequence of non-deterministic

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-06 Thread Wenchen Fan
Sounds reasonable to me. We should make the behavior consistent within Spark. On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler wrote: > Currently, when a PySpark Row is created with keyword arguments, the > fields are sorted alphabetically. This has created a lot of confusion with > users because it

Re: [VOTE] SPARK 3.0.0-preview (RC2)

2019-11-01 Thread Wenchen Fan
The PR builder uses Hadoop 2.7 profile, which makes me think that 2.7 is more stable and we should make releases using 2.7 by default. +1 On Fri, Nov 1, 2019 at 7:16 AM Xiao Li wrote: > Spark 3.0 will still use the Hadoop 2.7 profile by default, I think. > Hadoop 2.7 profile is much more

Re: Re: A question about broadcast nest loop join

2019-10-23 Thread Wenchen Fan
Ah sorry I made a mistake. "Spark can only pick BroadcastNestedLoopJoin to implement left/right join" this should be "left/right non-equal join" On Thu, Oct 24, 2019 at 6:32 AM zhangliyun wrote: > > Hi Herman: >I guess what you mentioned before > ``` > if you are OK with slightly different

Re: A question about broadcast nest loop join

2019-10-23 Thread Wenchen Fan
I haven't looked into your query yet, just want to let you know that: Spark can only pick BroadcastNestedLoopJoin to implement left/right join. If the table is very big, then OOM happens. Maybe there is an algorithm to implement left/right join in a distributed environment without broadcast, but

Re: DataSourceV2 sync notes - 2 October 2019

2019-10-18 Thread Wenchen Fan
Hi Ryan, Thanks for summarizing and sending out the notes! I've created the JIRA ticket to add v2 statements for all the commands that need to resolve a table: https://issues.apache.org/jira/browse/SPARK-29481 Contributions to it are appreciated! Thanks, Wenchen On Fri, Oct 11, 2019 at 7:05 AM

Re: Apache Spark 3.0 timeline

2019-10-16 Thread Wenchen Fan
; figure we are probably moving to code freeze late in the year, release >> early next year? Any better ideas about estimates to publish? They aren't >> binding. >> >> On Wed, Oct 16, 2019, 4:01 PM Dongjoon Hyun >> wrote: >> >>> Hi, All. >>> >>

Re: branch-3.0 vs branch-3.0-preview (?)

2019-10-16 Thread Wenchen Fan
Does anybody remember what we did for 2.0 preview? Personally I'd like to avoid cutting branch-3.0 right now, otherwise we need to merge PRs into two branches in the following several months. Thanks, Wenchen On Wed, Oct 16, 2019 at 3:01 PM Xingbo Jiang wrote: > Hi Dongjoon, > > I'm

Re: [DISCUSS] ViewCatalog interface for DSv2

2019-10-14 Thread Wenchen Fan
I'm fine with the view definition proposed here, but my major concern is how to make sure table/view share the same namespace. According to the SQL spec, if there is a view named "a", we can't create a table named "a" anymore. We can add documents and ask the implementation to guarantee it, but

Re: [build system] IMPORTANT! northern california fire danger, potential power outage(s)

2019-10-09 Thread Wenchen Fan
Thanks for the updates! On Thu, Oct 10, 2019 at 5:34 AM Shane Knapp wrote: > quick update: > > campus is losing power @ 8pm. this is after we were told 4am, 8am, > noon, and 2-4pm. :) > > PG expects to start bringing alameda county back online at noon > tomorrow, but i believe that target to

Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

2019-10-08 Thread Wenchen Fan
d to apply "package hack" but also need to > depend on catalyst. > > > On Mon, Oct 7, 2019 at 9:45 PM Wenchen Fan wrote: > >> AFAIK there is no public streaming data source API before DS v2. The >> Source and Sink API is private and is only for builtin streaming sourc

Re: Spark 3.0 preview release feature list and major changes

2019-10-08 Thread Wenchen Fan
rella ticket instead SPARK-25390 <https://issues.apache.org/jira/browse/SPARK-25390> data source V2 API refactoring Thanks, Wenchen On Wed, Oct 9, 2019 at 1:19 PM Dongjoon Hyun wrote: > Thank you for the preparation of 3.0-preview, Xingbo! > > Bests, > Dongjoon. > > On Tue,

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-07 Thread Wenchen Fan
+1 I think this is the most reasonable default behavior among the three. On Mon, Oct 7, 2019 at 6:06 PM Alessandro Solimando < alessandro.solima...@gmail.com> wrote: > +1 (non-binding) > > I have been following this standardization effort and I think it is sound > and it provides the needed

Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

2019-10-07 Thread Wenchen Fan
AFAIK there is no public streaming data source API before DS v2. The Source and Sink API is private and is only for builtin streaming sources. Advanced users can still implement custom stream sources with private Spark APIs (you can put your classes under the org.apache.spark.sql package to access

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Wenchen Fan
dynamic partition pruning rule generates "hidden" filters that will be converted to real predicates at runtime, so it doesn't matter where we run the rule. For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's better to run it before join reorder. On Sun, Sep 29, 2019 at 5:51 AM

Re: Spark 3.0 preview release on-going features discussion

2019-09-20 Thread Wenchen Fan
> New pushdown API for DataSourceV2 One correction: I want to revisit the pushdown API to make sure it works for dynamic partition pruning and can be extended to support limit/aggregate/... pushdown in the future. It should be a small API update instead of a new API. On Fri, Sep 20, 2019 at 3:46

Re: [DISCUSS][SPIP][SPARK-29031] Materialized columns

2019-09-15 Thread Wenchen Fan
ource to fix these problems themselves. Thanks, Wenchen On Tue, Sep 10, 2019 at 5:47 PM Jason Guo wrote: > Hi, > > I'd like to propose a feature name materialized column. This feature will > boost queries on complex type columns. > > <http://goog_64495576> &g

Re: Thoughts on Spark 3 release, or a preview release

2019-09-15 Thread Wenchen Fan
I don't expect to see a large DS V2 API change from now on. But we may update the API a little bit if we find problems during the preview. On Sat, Sep 14, 2019 at 10:16 PM Sean Owen wrote: > I don't think this suggests anything is finalized, including APIs. I > would not guess there will be

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-12 Thread Wenchen Fan
5:28 PM > *To:* Alastair Green > *Cc:* Reynold Xin; Wenchen Fan; Spark dev list; Gengliang Wang > *Subject:* Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in > table insertion by default > > > We discussed this thread quite a bit in the DSv2 sync up and Russell >

Re: Welcoming some new committers and PMC members

2019-09-09 Thread Wenchen Fan
Congratulations! On Tue, Sep 10, 2019 at 10:19 AM Yuanjian Li wrote: > Congratulations! > > sujith chacko 于2019年9月10日周二 上午10:15写道: > >> Congratulations all. >> >> On Tue, 10 Sep 2019 at 7:27 AM, Haibo wrote: >> >>> congratulations~ >>> >>> >>> >>> 在2019年09月10日 09:30,Joseph Torres >>> 写道: >>>

Re: DSv2 sync - 4 September 2019

2019-09-09 Thread Wenchen Fan
ticket if you have some better ideas. Thanks, Wenchen On Mon, Sep 9, 2019 at 12:46 AM Nicholas Chammas wrote: > A quick question about failure modes, as a casual observer of the DSv2 > effort: > > I was considering filing a JIRA ticket about enhancing the DataFrameReader > to incl

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-05 Thread Wenchen Fan
policy can be a stopper as it's a too big breaking change, which may break many existing queries. Thanks, Wenchen On Wed, Sep 4, 2019 at 1:59 PM Gengliang Wang wrote: > Hi everyone, > > I'd like to call for a vote on SPARK-28885 > <https://issues.apache.org/jira/browse/SPARK-28885&g

Re: [ANNOUNCE] Announcing Apache Spark 2.4.4

2019-09-01 Thread Wenchen Fan
Great! Thanks! On Mon, Sep 2, 2019 at 5:55 AM Dongjoon Hyun wrote: > We are happy to announce the availability of Spark 2.4.4! > > Spark 2.4.4 is a maintenance release containing stability fixes. This > release is based on the branch-2.4 maintenance branch of Spark. We strongly > recommend all

Re: [VOTE] Release Apache Spark 2.4.4 (RC3)

2019-08-28 Thread Wenchen Fan
+1, no more blocking issues that I'm aware of. On Wed, Aug 28, 2019 at 8:33 PM Sean Owen wrote: > +1 from me again. > > On Tue, Aug 27, 2019 at 6:06 PM Dongjoon Hyun > wrote: > > > > Please vote on releasing the following candidate as Apache Spark version > 2.4.4. > > > > The vote is open

Re: [VOTE] Release Apache Spark 2.3.4 (RC1)

2019-08-27 Thread Wenchen Fan
+1 On Wed, Aug 28, 2019 at 2:43 AM DB Tsai wrote: > +1 > > Sincerely, > > DB Tsai > -- > Web: https://www.dbtsai.com > PGP Key ID: 42E5B25A8F7A82C1 > > On Tue, Aug 27, 2019 at 11:31 AM Dongjoon Hyun > wrote: > > > > +1. > > > > I also

Re: Apache Spark git repo moved to gitbox.apache.org

2019-08-26 Thread Wenchen Fan
yea I think we should, but no need to worry too much about it because gitbox still works in the release scripts. On Tue, Aug 27, 2019 at 3:23 AM Shane Knapp wrote: > revisiting this old thread... > > i noticed from the committers' page on the spark site that the 'apache' > remote should be

Re: JDK11 Support in Apache Spark

2019-08-25 Thread Wenchen Fan
Great work! On Sun, Aug 25, 2019 at 6:03 AM Xiao Li wrote: > Thank you for your contributions! This is a great feature for Spark > 3.0! We finally achieve it! > > Xiao > > On Sat, Aug 24, 2019 at 12:18 PM Felix Cheung > wrote: > >> That’s great! >> >> -- >> *From:*

Re: [VOTE] Release Apache Spark 2.4.4 (RC1)

2019-08-19 Thread Wenchen Fan
have this fix in 2.3 and 2.4. Thanks, Wenchen On Tue, Aug 20, 2019 at 7:32 AM Dongjoon Hyun wrote: > Thank you for testing, Sean and Herman. > > There are three reporting until now. > > 1. SPARK-28775 is for JDK 8u221+ testing at Apache Spark 3.0/2.4/2.3. > 2. SPARK-28749 is for

Re: Release Spark 2.3.4

2019-08-18 Thread Wenchen Fan
+1 On Sat, Aug 17, 2019 at 3:37 PM Hyukjin Kwon wrote: > +1 too > > 2019년 8월 17일 (토) 오후 3:06, Dilip Biswal 님이 작성: > >> +1 >> >> Regards, >> Dilip Biswal >> Tel: 408-463-4980 >> dbis...@us.ibm.com >> >> >> >> - Original message - >> From: John Zhuge >> To: Xiao Li >> Cc: Takeshi

Re: [build system] colo maintenance & outage tomorrow, 10am-2pm PDT

2019-08-15 Thread Wenchen Fan
Thanks for tracking it Shane! On Fri, Aug 16, 2019 at 7:41 AM Shane Knapp wrote: > just got an update: > > there was a problem w/the replacement part, and they're trying to fix it. > if that's successful, the expect to have power restored within the hour. > > if that doesn't work, a new (new)

Re: Release Apache Spark 2.4.4

2019-08-13 Thread Wenchen Fan
+1 On Wed, Aug 14, 2019 at 12:52 PM Holden Karau wrote: > +1 > Does anyone have any critical fixes they’d like to see in 2.4.4? > > On Tue, Aug 13, 2019 at 5:22 PM Sean Owen wrote: > >> Seems fine to me if there are enough valuable fixes to justify another >> release. If there are any other

Re: displaying "Test build" in PR

2019-08-13 Thread Wenchen Fan
"Can one of the admins verify this patch?" is a corrected message, as Jenkins won't test your PR until an admin approves it. BTW I think "5 minutes" is a reasonable delay for PR testing. It usually takes days to review and merge a PR, so I don't think seeing test progress right after PR creation

Re: [SPARK-23207] Repro

2019-08-12 Thread Wenchen Fan
, so currently the fix is to fail the job if the scheduler needs to retry an indeterminate shuffle map stage. It would be great to know if we can reproduce this bug with the master branch. Thanks, Wenchen On Sun, Aug 11, 2019 at 7:22 AM Xiao Li wrote: > Hi, Tyson, > > Could you open a

Re: DataSourceV2 : Transactional Write support

2019-08-05 Thread Wenchen Fan
I agree with the temp table approach. One idea is: maybe we only need one temp table, and each task writes to this temp table. At the end we read the data from the temp table and write it to the target table. AFAIK JDBC can handle concurrent table writing very well, and it's better than creating

<    1   2   3   4   5   6   >