Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matt Cheah
, -Matt Cheah smime.p7s Description: S/MIME cryptographic signature

Spark shuffle consolidateFiles performance degradation quantification

2014-11-03 Thread Matt Cheah
, -Matt Cheah smime.p7s Description: S/MIME cryptographic signature

[Classloading] Strange class loading issue

2014-11-05 Thread Matt Cheah
I¹ve made as well. Thanks, -Matt Cheah smime.p7s Description: S/MIME cryptographic signature

JavaRDD Aggregate initial value - Closure-serialized zero value reasoning?

2015-02-17 Thread Matt Cheah
aggregateByKey is also affected. Thanks, -Matt Cheah smime.p7s Description: S/MIME cryptographic signature

Re: JavaRDD Aggregate initial value - Closure-serialized zero value reasoning?

2015-02-18 Thread Matt Cheah
it and need to apply the change to aggregate()? It seems appropriate to target a fix for 1.3.0. -Matt Cheah From: Josh Rosen rosenvi...@gmail.com Date: Wednesday, February 18, 2015 at 6:12 AM To: Matt Cheah mch...@palantir.com Cc: dev@spark.apache.org dev@spark.apache.org, Mingyu Kim m

[Performance] Possible regression in rdd.take()?

2015-02-18 Thread Matt Cheah
regression. I was wondering if anyone else observed this regression, and if so, if anyone would have any idea what could possibly have caused it between Spark 1.0.2 and Spark 1.1.1? Thanks, -Matt Cheah smime.p7s Description: S/MIME cryptographic signature

Re: [Performance] Possible regression in rdd.take()?

2015-02-18 Thread Matt Cheah
I actually tested Spark 1.2.0 with the code in the rdd.take() method swapped out for what was in Spark 1.0.2. The run time was still slower, which indicates to me something at work lower in the stack. -Matt Cheah On 2/18/15, 4:54 PM, Patrick Wendell pwend...@gmail.com wrote: I believe

Re: [Performance] Possible regression in rdd.take()?

2015-02-18 Thread Matt Cheah
for the quick and accurate response! -Matt CHeah From: Aaron Davidson ilike...@gmail.com Date: Wednesday, February 18, 2015 at 5:25 PM To: Matt Cheah mch...@palantir.com Cc: Patrick Wendell pwend...@gmail.com, dev@spark.apache.org dev@spark.apache.org, Mingyu Kim m...@palantir.com, Sandor Van

Re: [SQL] Write parquet files under partition directories?

2015-06-02 Thread Matt Cheah
Excellent! Where can I find the code, pull request, and Spark ticket where this was introduced? Thanks, -Matt Cheah From: Reynold Xin r...@databricks.com Date: Monday, June 1, 2015 at 10:25 PM To: Matt Cheah mch...@palantir.com Cc: dev@spark.apache.org dev@spark.apache.org, Mingyu Kim m

[SQL] Write parquet files under partition directories?

2015-06-01 Thread Matt Cheah
output format, but it looks like ParquetTableOperations.scala has fixed the output format to AppendingParquetOutputFormat. Also, I was wondering if it would be valuable to contribute writing Parquet in partition directories as a PR. Thanks, -Matt Cheah smime.p7s Description: S/MIME cryptographic

PySpark GroupByKey implementation question

2015-07-14 Thread Matt Cheah
with map-side-combine set to false. Is it something specific to how Pyspark can potentially spill the individual groups to disk? Thanks, -Matt Cheah P.S. Relevant Links: https://issues.apache.org/jira/browse/SPARK-3074 https://github.com/apache/spark/pull/1977 smime.p7s Description: S/MIME

Re: DataFrames Aggregate does not spill?

2015-09-21 Thread Matt Cheah
I was executing on Spark 1.4 so I didn¹t notice the Tungsten option would make spilling happen in 1.5. I¹ll upgrade to 1.5 and see how that turns out. Thanks! From: Reynold Xin <r...@databricks.com> Date: Monday, September 21, 2015 at 5:36 PM To: Matt Cheah <mch...@palantir.com&g

DataFrames Aggregate does not spill?

2015-09-21 Thread Matt Cheah
hought on this problem is. Did we consciously think about the robustness implications when choosing to use an in memory Hash Map to compute the aggregation? Is this an inherent limitation of the aggregation implementation in Data Frames? Thanks, -Matt Cheah smime.p7s Description:

Quick question regarding Maven and Spark Assembly jar

2015-12-03 Thread Matt Cheah
Hi everyone, A very brief question out of curiosity ­ is there any particular reason why we don¹t publish the Spark assembly jar on the Maven repository? Thanks, -Matt Cheah smime.p7s Description: S/MIME cryptographic signature

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-24 Thread Matt Cheah
-1 because of SPARK-16181 which is a correctness regression from 1.6. Looks like the patch is ready though: https://github.com/apache/spark/pull/13884 – it would be ideal for this patch to make it into the release. -Matt Cheah From: Nick Pentreath <nick.pentre...@gmail.com<mailto:nick.

Re: Preserving partitioning with dataframe select

2016-02-08 Thread Matt Cheah
be appreciated! -Matt Cheah From: Reynold Xin <r...@databricks.com> Date: Sunday, February 7, 2016 at 11:11 PM To: Matt Cheah <mch...@palantir.com> Cc: "dev@spark.apache.org" <dev@spark.apache.org>, Mingyu Kim <m...@palantir.com> Subject: Re: Preserving

Preserving partitioning with dataframe select

2016-02-05 Thread Matt Cheah
ing the amount of data that is shuffled? 2) If the planner takes advantage of co-partitioning, is the renaming of the columns invalidating the partitioning of the grouped Data Frame? When I look at the planner's conversion from logical.Project to the physical plan, I only see it invoking child.mapPartitions without specifying the preservesPartitioning flag. Thanks, -Matt Cheah

HashedRelation Memory Pressure on Broadcast Joins

2016-03-01 Thread Matt Cheah
iver’s memory, but my driver ran out of memory after I increased the autoBroadcastJoinThreshold. YourKit is indicating that this logic is consuming more memory than my driver can handle. Thanks, -Matt Cheah

Understanding fault tolerance in shuffle operations

2016-03-10 Thread Matt Cheah
shuffle data? This is the scenario I’m running into most, where my tasks fail because they try to reach the shuffle service instead of trying to recompute the lost shuffle files. Thanks, -Matt Cheah

Re: HashedRelation Memory Pressure on Broadcast Joins

2016-03-02 Thread Matt Cheah
that it is difficult to reason about dataset size on disk vs. memory. -Matt Cheah On 3/2/16, 10:15 AM, "Davies Liu" <dav...@databricks.com> wrote: >UnsafeHashedRelation and HashedRelation could also be used in Executor >(for non-broadcast hash join), then the UnsafeRow could come from >

Re: spark 2.0 issue with yarn?

2016-05-09 Thread Matt Cheah
at all. However jersey-client looks relatively harmless since it does not bundle in JAX-RS classes, nor does it appear to have anything weird in its META-INF folder. -Matt Cheah On 5/9/16, 3:10 PM, "Marcelo Vanzin" <van...@cloudera.com> wrote: >Hi Jesse, > >On Mo

Proposal for SPARK-18278

2016-11-29 Thread Matt Cheah
nagers. A first draft of a proposal outlining a potential long-term plan around this feature has been attached to the JIRA ticket. Any feedback and discussion would be greatly appreciated. Thanks, -Matt Cheah

Re: SPIP: Spark on Kubernetes

2017-08-18 Thread Matt Cheah
submission client, so that would need to be pluggable as well. More discussion on fully pluggable scheduler backends is at https://issues.apache.org/jira/browse/SPARK-19700. -Matt Cheah From: Erik Erlandson <eerla...@redhat.com> Date: Friday, August 18, 2017 at 8:34 AM To

Re: Kubernetes backend and docker images

2018-01-08 Thread Matt Cheah
. From: Anirudh Ramanathan <ramanath...@google.com.INVALID> Date: Monday, January 8, 2018 at 9:48 AM To: Felix Cheung <felixcheun...@hotmail.com> Cc: 蒋星博 <jiangxb1...@gmail.com>, Marcelo Vanzin <van...@cloudera.com>, dev <dev@spark.apache.org>, Matt Cheah <mch...@p

Re: Kubernetes backend and docker images

2018-01-08 Thread Matt Cheah
Think we can allow for different images and default to them being the same. Apologize if I missed that as being the original intention though. -Matt Cheah On 1/8/18, 1:45 PM, "Marcelo Vanzin" <van...@cloudera.com> wrote: On Mon, Jan 8, 2018 at 1:39 PM, Matt Cheah <

Re: Kubernetes backend and docker images

2018-01-08 Thread Matt Cheah
// Fixing Anirudh's email address From: Matt Cheah Sent: Monday, January 8, 2018 1:39:12 PM To: Anirudh Ramanathan; Felix Cheung Cc: 蒋星博; Marcelo Vanzin; dev; Timothy Chen Subject: Re: Kubernetes backend and docker images We would still want images to be able

Re: Kubernetes: why use init containers?

2018-01-10 Thread Matt Cheah
anzin" <van...@cloudera.com> wrote: On Wed, Jan 10, 2018 at 1:33 PM, Matt Cheah <mch...@palantir.com> wrote: > If we use spark-submit in client mode from the driver container, how do we handle needing to switch between a cluster-mode scheduler backend and a client-mode

Re: Kubernetes: why use init containers?

2018-01-10 Thread Matt Cheah
circumstances, versus client mode being allowed with a specific flag. If we’re saying that we don’t support client mode, we should bias towards making client mode as difficult as possible to access, i.e. impossible with a standard Spark distribution. -Matt Cheah On 1/10/18, 1:24 PM, "Marcelo V

Re: Kubernetes: why use init containers?

2018-01-10 Thread Matt Cheah
don’t need to use spark-submit at all, meaning that the differences can more or less be ignored at least in this particular context. -Matt Cheah On 1/10/18, 8:40 AM, "Marcelo Vanzin" <van...@cloudera.com> wrote: On a side note, while it's great that you guys have meeting

Re: Kubernetes: why use init containers?

2018-01-10 Thread Matt Cheah
definitely a niche use case – I’m not sure how often pod presets are used in practice - but it’s an example to illustrate why the separation of concerns can be beneficial. -Matt Cheah On 1/10/18, 2:36 PM, "Marcelo Vanzin" <van...@cloudera.com> wrote: On Wed, Jan 10, 2018 at 2

Re: Kubernetes: why use init containers?

2018-01-09 Thread Matt Cheah
. From: Yinan Li <liyinan...@gmail.com> Date: Tuesday, January 9, 2018 at 7:16 PM To: Nicholas Chammas <nicholas.cham...@gmail.com> Cc: Anirudh Ramanathan <ramanath...@google.com.invalid>, Marcelo Vanzin <van...@cloudera.com>, Matt Cheah <mch...@palantir.com>, Kimo

Re: Spark on Kubernetes Builder Pattern Design Document

2018-02-05 Thread Matt Cheah
with the Spark community via the Spark mailing list and Spark JIRA tickets. We’re specifically aiming to deprecate the fork and migrate all the work done on the fork into the main line. -Matt Cheah From: Mark Hamstra <m...@clearstorydata.com> Date: Monday, February 5, 2018 at 1:44 PM To: Matt

Spark on Kubernetes Builder Pattern Design Document

2018-02-05 Thread Matt Cheah
/1XPLh3E2JJ7yeJSDLZWXh_lUcjZ1P0dy9QeUEyxIlfak/edit# I hope that we can have a productive discussion and continue improving the Kubernetes integration further. Thanks, -Matt Cheah smime.p7s Description: S/MIME cryptographic signature

[Feedback Requested] SPARK-25299: Using Distributed Storage for Persisting Shuffle Data

2018-08-31 Thread Matt Cheah
a lot that needs to be discussed for this improvement, so we hope to get as much input as possible before moving forward with a design. Please feel free to leave comments and suggestions on the JIRA ticket or on the discussion document. Thank you! -Matt Cheah smime.p7s Description

Re: [Feedback Requested] SPARK-25299: Using Distributed Storage for Persisting Shuffle Data

2018-09-04 Thread Matt Cheah
from more Spark users. The experience would be greatly appreciated in the discussion. -Matt Cheah From: Yuanjian Li Date: Friday, August 31, 2018 at 8:29 PM To: Matt Cheah Cc: Spark dev list Subject: Re: [Feedback Requested] SPARK-25299: Using Distributed Storage for Persisting Shuffle

Re: [Kubernetes] Resource requests and limits for Driver and Executor Pods

2018-03-30 Thread Matt Cheah
The question is more so generally what an advised best practice is for setting CPU limits. It’s not immediately clear what a correct value is for setting CPU limits if one wants to provide guarantees for consistent / guaranteed execution performance while also not degrading performance.

Re: Toward an "API" for spark images used by the Kubernetes back-end

2018-03-22 Thread Matt Cheah
Re: Hadoop versioning – it seems reasonable enough for us to be publishing an image per Hadoop version. We should essentially have image configuration parity with what we publish as distributions on the Spark website. Sometimes jars need to be swapped out entirely instead of being strictly

Re: time for Apache Spark 3.0?

2018-11-12 Thread Matt Cheah
of all proposed breaking changes / JIRA tickets? Perhaps we can include it in the JIRA ticket that can be filtered down to somehow? Thanks, -Matt Cheah From: Vinoo Ganesh Date: Monday, November 12, 2018 at 2:48 PM To: Reynold Xin Cc: Xiao Li , Matei Zaharia , Ryan Blue , Mark Hamstra , dev

Re: time for Apache Spark 3.0?

2018-11-13 Thread Matt Cheah
I just added the label to https://issues.apache.org/jira/browse/SPARK-25908. Unsure if there are any others. I’ll look through the tickets and see if there are any that are missing the label. -Matt Cheah From: Sean Owen Date: Tuesday, November 13, 2018 at 12:09 PM To: Matt Cheah Cc

Re: time for Apache Spark 3.0?

2018-11-13 Thread Matt Cheah
ease-notes' with a description of the change. The release itself has a migration guide that's being updated as we go. On Mon, Nov 12, 2018 at 5:49 PM Matt Cheah wrote: > > I wanted to clarify what categories of APIs are eligible to be broken in Spark 3

Re: [DISCUSS][K8S] Local dependencies with Kubernetes

2018-10-08 Thread Matt Cheah
Relying on kubectl exec may not be the best solution because clusters with locked down security will not grant users permissions to execute arbitrary code in pods. I can’t think of a great alternative right now but I wanted to bring this to our attention for the time being. -Matt Cheah

Re: [DISCUSS] Function plugins

2018-12-14 Thread Matt Cheah
than Hive UDFs. -Matt Cheah From: Reynold Xin Date: Friday, December 14, 2018 at 1:49 PM To: "rb...@netflix.com" Cc: Spark Dev List Subject: Re: [DISCUSS] Function plugins Having a way to register UDFs that are not using Hive APIs would be great! On Fri, Dec 14,

SPARk-25299: Updates As Of December 19, 2018

2018-12-19 Thread Matt Cheah
further discussion in this space. You may comment in this e-mail thread or by commenting on the progress report document. Looking forward to hearing from you. Thanks, -Matt Cheah smime.p7s Description: S/MIME cryptographic signature

Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-22 Thread Matt Cheah
+1 for n-part namespace as proposed. Agree that a short SPIP would be appropriate for this. Perhaps also a JIRA ticket? -Matt Cheah From: Felix Cheung Date: Sunday, January 20, 2019 at 4:48 PM To: "rb...@netflix.com" , Spark Dev List Subject: Re: [DISCUSS] Identifiers

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Matt Cheah
+1 (non-binding) Are identifiers and namespaces going to be rolled under one of those six points? From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Thursday, February 28, 2019 at 8:39 AM To: Spark Dev List Subject: [VOTE] Functional DataSourceV2 in Spark 3.0 I’d like to call a

Re: [VOTE] SPIP: Spark API for Table Metadata

2019-02-28 Thread Matt Cheah
+1 (non-binding) From: Jamison Bennett Date: Thursday, February 28, 2019 at 8:28 AM To: Ryan Blue , Spark Dev List Subject: Re: [VOTE] SPIP: Spark API for Table Metadata +1 (non-binding) Jamison Bennett Cloudera Software Engineer jamison.benn...@cloudera.com 515 Congress Ave, Suite

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Matt Cheah
), what appeal is there for users to upgrade to that latest version? -Matt Cheah On 2/28/19, 1:37 PM, "Mridul Muralidharan" wrote:   I am -1 on this vote for pretty much all the reasons that Mark mentioned.     A major version change gives us an opportunity to remove

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Matt Cheah
is going to take to implement and review. -Matt Cheah On 2/24/19, 3:05 PM, "Sean Owen" wrote: Sure, I don't read anyone making these statements though? Let's assume good intent, that "foo should happen" as "my opinion as a member of the community, whi

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Matt Cheah
Reynold made a note earlier about a proper Row API that isn’t InternalRow – is that still on the table? -Matt Cheah From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Tuesday, February 26, 2019 at 4:40 PM To: Matt Cheah Cc: Sean Owen , Wenchen Fan , Xiao Li , Matei Zahar

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Matt Cheah
Will that then require an API break down the line? Do we save that for Spark 4? -Matt Cheah? From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Tuesday, February 26, 2019 at 4:53 PM To: Matt Cheah Cc: Sean Owen , Wenchen Fan , Xiao Li , Matei Zaharia , Spark Dev List S

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Matt Cheah
point release. But #1 and #2 are also the features that have remained open for the longest time and we really need to move forward on these. Putting a target release for 3.0 will help in that regard. -Matt Cheah From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Thursday, F

Re: building docker images for GPU

2019-02-11 Thread Matt Cheah
, and then provide detailed instructions for how to build custom Docker images (mostly just needing to make sure the custom image has the right entry point). -Matt Cheah From: Rong Ou Date: Friday, February 8, 2019 at 2:28 PM To: "dev@spark.apache.org" Subject: building docker images for

Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API

2019-06-05 Thread Matt Cheah
this to a voting phase and to begin proposing our work against upstream Spark? Thanks, -Matt Cheah From: "Yifei Huang (PD)" Date: Monday, May 13, 2019 at 1:04 PM To: Mridul Muralidharan Cc: Bo Yang , Ilan Filonenko , Imran Rashid , Justin Uang , Liang Tang , Marcelo Vanzin , Mat

Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API

2019-06-14 Thread Matt Cheah
We opened a thread for voting yesterday, so please participate! -Matt Cheah From: Yue Li Date: Thursday, June 13, 2019 at 7:22 PM To: Saisai Shao , Imran Rashid Cc: Matt Cheah , "Yifei Huang (PD)" , Mridul Muralidharan , Bo Yang , Ilan Filonenko , Imran Rashid , Justin Uan

[VOTE][SPARK-25299] SPIP: Shuffle Storage API

2019-06-13 Thread Matt Cheah
or not this proposal is agreeable to you. Thanks! -Matt Cheah smime.p7s Description: S/MIME cryptographic signature

Re: [Discuss] Follow ANSI SQL on table insertion

2019-08-02 Thread Matt Cheah
enchen Fan , Hyukjin Kwon , Russell Spitzer , Ryan Blue , Reynold Xin , Matt Cheah , Takeshi Yamamuro , Spark dev list Subject: Re: [Discuss] Follow ANSI SQL on table insertion Hi all, Let me explain a little bit on the proposal. By default, we follow the store assignment rule

Re: DataSourceV2 : Transactional Write support

2019-08-02 Thread Matt Cheah
specific semantics we don’t support in the V2 API. For example, one cannot commit multiple write operations in a single transaction right now. That would require changes to the DDL and a pretty substantial change to the design of Spark-SQL more broadly. -Matt Cheah From: Shiv Prashant

Re: Thoughts on Spark 3 release, or a preview release

2019-09-17 Thread Matt Cheah
the Spark 3 preview release specifically on SPARK-25299. -Matt Cheah From: Xiao Li Date: Tuesday, September 17, 2019 at 12:00 AM To: Erik Erlandson Cc: Sean Owen , dev Subject: Re: Thoughts on Spark 3 release, or a preview release https://issues.apache.org/jira/browse/SPARK-28264

Re: Thoughts on Spark 3 release, or a preview release

2019-09-12 Thread Matt Cheah
+1 as both a contributor and a user. From: John Zhuge Date: Thursday, September 12, 2019 at 4:15 PM To: Jungtaek Lim Cc: Jean Georges Perrin , Hyukjin Kwon , Dongjoon Hyun , dev Subject: Re: Thoughts on Spark 3 release, or a preview release +1 Like the idea as a user and a DSv2

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-31 Thread Matt Cheah
Sorry I meant the current behavior for V2, which fails the query compilation if the cast is not safe. Agreed that a separate discussion about overflow might be warranted. I’m surprised we don’t throw an error now, but it might be warranted to do so. -Matt Cheah From: Reynold Xin

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-31 Thread Matt Cheah
, or perhaps the behavior can be flagged by the destination writer at write time. -Matt Cheah From: Hyukjin Kwon Date: Monday, July 29, 2019 at 11:33 PM To: Wenchen Fan Cc: Russell Spitzer , Takeshi Yamamuro , Gengliang Wang , Ryan Blue , Spark dev list Subject: Re: [Discuss] Follow ANSI SQL

Re: DataSourceV2 : Transactional Write support

2019-08-05 Thread Matt Cheah
There might be some help from the staging table catalog as well. -Matt Cheah From: Wenchen Fan Date: Monday, August 5, 2019 at 7:40 PM To: Shiv Prashant Sood Cc: Ryan Blue , Jungtaek Lim , Spark Dev List Subject: Re: DataSourceV2 : Transactional Write support I agree with the temp

[SPARK-25299] A Discussion About Shuffle Metadata Tracking

2020-03-13 Thread Matt Cheah
ential viable options, so I’m looking forward to engaging with dialogue moving forward. Thanks! -Matt Cheah