Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Xiao Li
+1 on Jungtaek's point. We can revisit this when we release Spark 3.1? After the release of 3.0, I believe we will get more feedback about DSv2 from the community. The current design is just made by a small group of contributors. DSv2 + catalog APIs are still evolving. It is very likely we will

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Jungtaek Lim
Just 2 cents, I haven't tracked the change of DSv2 (though I needed to deal with this as the change made confusion on my PRs...), but my bet is that DSv2 would be already changed in incompatible way, at least who works for custom DataSource. Making downstream to diverge their implementation

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Jungtaek Lim
small correction: confusion -> conflict, so I had to go through and understand parts of the changes On Sat, Sep 21, 2019 at 1:25 PM Jungtaek Lim wrote: > Just 2 cents, I haven't tracked the change of DSv2 (though I needed to > deal with this as the change made confusion on my PRs...), but my

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Dongjoon Hyun
Do you mean you want to have a breaking API change between 3.0 and 3.1? I believe we follow Semantic Versioning ( https://spark.apache.org/versioning-policy.html ). > We just won’t add any breaking changes before 3.1. Bests, Dongjoon. On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue wrote: > I

Spark Tasks Progress

2019-09-20 Thread Sultan Alamro
Hi all, I am trying to do some actions at the Driver side in Spark while an application is running. The Driver needs to know the tasks progress before making any decision. I know that tasks progress can be accessed within each executor or task from RecordReader class by calling getProgress().

JDK11 QA (SPARK-29194)

2019-09-20 Thread Dongjoon Hyun
Hi, All. As a next step, we started JDK11 QA. https://issues.apache.org/jira/browse/SPARK-29194 This issue mainly focuses on the following areas, but feel free to add any sub-issues which you hit on JDK11 from now. - Documentations - Examples - Performance - Integration Tests

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
I don’t think we need to gate a 3.0 release on making a more stable version of InternalRow Sounds like we agree, then. We will use it for 3.0, but there are known problems with it. Thinking we’d have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin
I don't think we need to gate a 3.0 release on making a more stable version of InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise. To point out some problems

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
When you created the PR to make InternalRow public This isn’t quite accurate. The change I made was to use InternalRow instead of UnsafeRow, which is a specific implementation of InternalRow. Exposing this API has always been a part of DSv2 and while both you and I did some work to avoid this, we

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Sean Owen
I don't know enough about DSv2 to comment on this part, but, any theoretical 2.5 is still a ways off. Does waiting for 3.0 to 'stabilize' it as much as is possible help? I say that because re: Java 11, the main breaking change is probably the Hive 2 / Hadoop 3 dependency, JPMML (minor), as well

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin
To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
I didn't realize that Java 11 would require breaking changes. What breaking changes are required? On Fri, Sep 20, 2019 at 11:18 AM Sean Owen wrote: > Narrowly on Java 11: the problem is that it'll take some breaking > changes, more than would be usually appropriate in a minor release, I >

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
> DSv2 is far from stable right? No, I think it is reasonably stable and very close to being ready for a release. > All the actual data types are unstable and you guys have completely ignored that. I think what you're referring to is the use of `InternalRow`. That's a stable API and there has

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Sean Owen
Narrowly on Java 11: the problem is that it'll take some breaking changes, more than would be usually appropriate in a minor release, I think. I'm still not convinced there is a burning need to use Java 11 but stay on 2.4, after 3.0 is out, and at least the wheels are in motion there. Java 8 is

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider

Re: Spark 3.0 preview release on-going features discussion

2019-09-20 Thread Ryan Blue
I’m not sure that DSv2 list is accurate. We discussed this in the DSv2 sync this week (just sent out the notes) and came up with these items: - Finish TableProvider update to avoid another API change: pass all table config from metastore - Catalog behavior fix:

[DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
Hi everyone, In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added. A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a

DSv2 sync notes - 28 September 2019

2019-09-20 Thread Ryan Blue
Here are my notes from this week’s DSv2 sync. *Attendees*: Ryan Blue Holden Karau Russell Spitzer Terry Kim Wenchen Fan Shiv Prashant Sood Joseph Torres Gengliang Wang Matt Cheah Burak Yavuz *Topics*: - Driver-side Hadoop conf - SHOW DATABASES/NAMESPACES behavior - Review outstanding

Re: Spark 3.0 preview release on-going features discussion

2019-09-20 Thread Dongjoon Hyun
Thank you for the summarization, Xingbo. I also agree with Sean because I don't think those block 3.0.0 preview release. Especially, correctness issues should not be there. Instead, could you summarize what we have as of now for 3.0.0 preview? I believe JDK11 (SPARK-28684) and Hive 2.3.5

custom FileStreamSource which reads from one partition onwards

2019-09-20 Thread Georg Heiler
Hi, to my best knowledge, the existing FileStreamSource reads all the files in a directory (hive table). However, I need to be able to specify an initial partition it should start from (i.e. like a Kafka offset/initial warmed-up state) and then only read data which is semantically (i.e. using a

Re: Spark 3.0 preview release on-going features discussion

2019-09-20 Thread Sean Owen
Is this a list of items that might be focused on for the final 3.0 release? At least, Scala 2.13 support shouldn't be on that list. The others look plausible, or are already done, but there are probably more. As for the 3.0 preview, I wouldn't necessarily block on any particular feature, though,

Re: Parquet read performance for different schemas

2019-09-20 Thread Tomas Bartalos
I forgot to mention important part that I'm issuing same query to both parquets - selecting only one column: df.select(sum('amount)) BR, Tomas št 19. 9. 2019 o 18:10 Tomas Bartalos napísal(a): > Hello, > > I have 2 parquets (each containing 1 file): > >- parquet-wide - schema has 25 top

Re: Spark 3.0 preview release on-going features discussion

2019-09-20 Thread Wenchen Fan
> New pushdown API for DataSourceV2 One correction: I want to revisit the pushdown API to make sure it works for dynamic partition pruning and can be extended to support limit/aggregate/... pushdown in the future. It should be a small API update instead of a new API. On Fri, Sep 20, 2019 at 3:46

Spark 3.0 preview release on-going features discussion

2019-09-20 Thread Xingbo Jiang
Hi all, Let's start a new thread to discuss the on-going features for Spark 3.0 preview release. Below is the feature list for the Spark 3.0 preview release. The list is collected from the previous discussions in the dev list. - Followup of the shuffle+repartition correctness issue: support