[OSS DIGEST] The major changes of Apache Spark from June 3 to June 16

2020-07-21 Thread Xingbo Jiang
Hi all, This is the bi-weekly Apache Spark digest from the Databricks OSS team. For each API/configuration/behavior change, an *[API] *tag is added in the title. CORE

Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread zero323
Given a discussion related to SPARK-32320 PR I'd like to resurrect this thread. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail:

Re: [OSS DIGEST] The major changes of Apache Spark from June 3 to June 16

2020-07-21 Thread Holden Karau
Got it, I missed the date in the reading :) On Tue, Jul 21, 2020 at 11:23 AM Xingbo Jiang wrote: > Hi Holden, > > This is the digest for commits merged between *June 3 and June 16.* The > commits you mentioned would be included in the future digests. > > Cheers, > > Xingbo > > On Tue, Jul 21,

Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread zero323
Given a discussion related to SPARK-32320 PR I'd like to resurrect this thread. Is there any interest in migrating annotations to the main repository? -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

Re: [OSS DIGEST] The major changes of Apache Spark from June 3 to June 16

2020-07-21 Thread Holden Karau
I'd also add [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown & [SPARK-21040][CORE] Speculate tasks which are running on decommission executors two of the PRs merged after the decommissioning SPIP. On Tue, Jul 21, 2020 at 10:53 AM Xingbo Jiang wrote: > Hi all, > > This

Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread Driesprong, Fokko
Since we've recently dropped support for Python <=3.5 , I think it would be nice to add support for type annotations. Having this in the main repository allows us to do type checking using MyPy in the CI itself.

Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread Holden Karau
Yeah I think this could be a great project now that we're only Python 3.5+. One potential is making this an Outreachy project to get more folks from different backgrounds involved in Spark. On Tue, Jul 21, 2020 at 12:33 PM Driesprong, Fokko wrote: > Since we've recently dropped support for

Re: [OSS DIGEST] The major changes of Apache Spark from June 3 to June 16

2020-07-21 Thread Xingbo Jiang
Hi Holden, This is the digest for commits merged between *June 3 and June 16.* The commits you mentioned would be included in the future digests. Cheers, Xingbo On Tue, Jul 21, 2020 at 11:13 AM Holden Karau wrote: > I'd also add [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are >

[DISCUSS] Amend the commiter guidelines on the subject of -1s & how we expect PR discussion to be treated.

2020-07-21 Thread Holden Karau
Hi Spark Developers, There has been a rather active discussion regarding the specific vetoes that occured during Spark 3. From that I believe we are now mostly in agreement that it would be best to clarify our rules around code vetoes & merging in general. Personally I believe this change is

Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread Driesprong, Fokko
Fully agree Holden, would be great to include the Outreachy project. Adding annotations is a very friendly way to get familiar with the codebase. I've also created a PR to see what's needed to get mypy in: https://github.com/apache/spark/pull/29180 From there on we can start adding annotations.

Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-21 Thread Holden Karau
Hi Folks, In Spark SQL there is the ability to have Spark do it's partition discovery/file listing in parallel on the worker nodes and also avoid locality lookups. I'd like to expose this in core, but given the Hadoop APIs it's a bit more complicated to do right. I made a quick POC and two

Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread Hyukjin Kwon
Yeah, I tend to be positive about leveraging the Python type hints in general. However, just to clarify, I don’t think we should just port the type hints into the main codes yet but maybe think about having/porting Maciej's work, pyi files as stubs. For now, I tend to think adding type hints to

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-07-21 Thread Steve Loughran
On Sun, 12 Jul 2020 at 01:45, gpongracz wrote: > As someone who mainly operates in AWS it would be very welcome to have the > option to use an updated version of hadoop using pyspark sourced from pypi. > > Acknowledging the issues of backwards compatability... > > The most vexing issue is the

Re: java.lang.ClassNotFoundException for s3a comitter

2020-07-21 Thread Steve Loughran
On Tue, 7 Jul 2020 at 03:42, Stephen Coy wrote: > Hi Steve, > > While I understand your point regarding the mixing of Hadoop jars, this > does not address the java.lang.ClassNotFoundException. > > Prebuilt Apache Spark 3.0 builds are only available for Hadoop 2.7 or > Hadoop 3.2. Not Hadoop 3.1.

Re: Catalog API for Partition

2020-07-21 Thread JackyLee
The `partitioning` in `TableCatalog.createTable` is a partition schema for table, which doesn't contains the partition metadata for an actual partition. Besides, the actual partition metadata may contains many partition schema, such as hive partition. Thus I created a `TablePartition` to contains

Re: Bridging gap between Spark UI and Code

2020-07-21 Thread Michal Sankot
And to be clear. Yes, execution plans show what exactly it's doing. The problem is that it's unclear how it's related to the actual Scala/Python code. On 7/21/20 15:45, Michal Sankot wrote: Yes, the problem is that DAGs only refer to code line (action) that inovked it. It doesn't provide

Re: Bridging gap between Spark UI and Code

2020-07-21 Thread Michal Sankot
Yes, the problem is that DAGs only refer to code line (action) that inovked it. It doesn't provide information about how individual transformations link to the code. So you can have dozen of stages, each with the same code line which invoked it, doing different stuff. And then we guess what

Bridging gap between Spark UI and Code

2020-07-21 Thread Michal Sankot
Hi, when I analyze and debug our Spark batch jobs executions it's a pain to find out how blocks in Spark UI Jobs/SQL tab correspond to the actual Scala code that we write and how much time they take. Would there be a way to somehow instruct compiler or something and get this information into

Re: Bridging gap between Spark UI and Code

2020-07-21 Thread Russell Spitzer
Have you looked in the DAG visualization? Each block refer to the code line invoking it. For Dataframes the execution plan will let you know explicitly which operations are in which stages. On Tue, Jul 21, 2020, 8:18 AM Michal Sankot wrote: > Hi, > when I analyze and debug our Spark batch jobs

InterpretedUnsafeProjection - error in getElementSize

2020-07-21 Thread Janda Martin
Hi, I think that I found error in InterpretedUnsafeProjection::getElementSize. This method differs from similar implementation in GenerateUnsafeProjection. InterpretedUnsafeProjection::getElementSize - returns wrong size for UDTs. I suggest to use similar code from GenerateUnsafeProjection.