Re: Autoscaling of Spark YARN cluster

2015-12-14 Thread Mingyu Kim
Cool. Using Ambari to monitor and scale up/down the cluster sounds promising. Thanks for the pointer! Mingyu From: Deepak Sharma <deepakmc...@gmail.com> Date: Monday, December 14, 2015 at 1:53 AM To: cs user <acldstk...@gmail.com> Cc: Mingyu Kim <m...@palantir.com>, &quo

Autoscaling of Spark YARN cluster

2015-12-14 Thread Mingyu Kim
Hi all, Has anyone tried out autoscaling Spark YARN cluster on a public cloud (e.g. EC2) based on workload? To be clear, I¹m interested in scaling the cluster itself up and down by adding and removing YARN nodes based on the cluster resource utilization (e.g. # of applications queued, # of

Re: compatibility issue with Jersey2

2015-10-13 Thread Mingyu Kim
Hi all, I filed https://issues.apache.org/jira/browse/SPARK-11081. Since Jersey’s surface area is relatively small and seems to be only used for Spark UI and json API, shading the dependency might make sense similar to what’s done for Jerry dependencies at

Re: Which OutputCommitter to use for S3?

2015-02-23 Thread Mingyu Kim
with hadoop 2.4. Thanks. Darin. From: Aaron Davidson ilike...@gmail.com To: Andrew Ash and...@andrewash.com Cc: Josh Rosen rosenvi...@gmail.com; Mingyu Kim m...@palantir.com; user@spark.apache.org user@spark.apache.org; Aaron Davidson aa...@databricks.com Sent: Saturday

Re: Which OutputCommitter to use for S3?

2015-02-20 Thread Mingyu Kim
I didn’t get any response. It’d be really appreciated if anyone using a special OutputCommitter for S3 can comment on this! Thanks, Mingyu From: Mingyu Kim m...@palantir.commailto:m...@palantir.com Date: Monday, February 16, 2015 at 1:15 AM To: user@spark.apache.orgmailto:user@spark.apache.org

Which OutputCommitter to use for S3?

2015-02-16 Thread Mingyu Kim
HI all, The default OutputCommitter used by RDD, which is FileOutputCommitter, seems to require moving files at the commit step, which is not a constant operation in S3, as discussed in http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3c543e33fa.2000...@entropy.be%3E. People

Re: How does Spark speculation prevent duplicated work?

2014-07-16 Thread Mingyu Kim
, writing to HDFS / S3 is idempotent. Now this logic is already implemented within the Hadoop's MapReduce logic, and Spark just uses it directly. TD On Tue, Jul 15, 2014 at 2:33 PM, Mingyu Kim m...@palantir.com wrote: Thanks for the explanation, guys. I looked

How does Spark speculation prevent duplicated work?

2014-07-15 Thread Mingyu Kim
Hi all, I was curious about the details of Spark speculation. So, my understanding is that, when ³speculated² tasks are newly scheduled on other machines, the original tasks are still running until the entire stage completes. This seems to leave some room for duplicated work because some spark

JavaRDD.mapToPair throws NPE

2014-06-24 Thread Mingyu Kim
Hi all, I¹m trying to use JavaRDD.mapToPair(), but it fails with NPE on the executor. The PairFunction used in the call is null for some reason. Any comments/help would be appreciated! My setup is, * Java 7 * Spark 1.0.0 * Hadoop 2.0.0-mr1-cdh4.6.0 Here¹s the code snippet. import

1.0.1 release plan

2014-06-19 Thread Mingyu Kim
Hi all, Is there any plan for 1.0.1 release? Mingyu smime.p7s Description: S/MIME cryptographic signature

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mingyu Kim
29, 2014 at 10:44 PM, Mingyu Kim m...@palantir.com wrote: Thanks for the quick response! To better understand it, the reason sorted RDD has a well-defined ordering is because sortedRDD.getPartitions() returns the partitions in the right order and each partition internally is properly sorted. So

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mingyu Kim
this, it wouldn't violate the contract of union AFIAK the only guarentee is the resulting RDD will contain all elements. - Patrick On Tue, Apr 29, 2014 at 11:26 PM, Mingyu Kim m...@palantir.com wrote: Yes, that’s what I meant. Sure, the numbers might not be actually sorted, but the order of rows

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mingyu Kim
that the API doesn't do. On Wed, Apr 30, 2014 at 11:13 AM, Mingyu Kim m...@palantir.com wrote: Okay, that makes sense. It’d be great if this can be better documented at some point, because the only way to find out about the resulting RDD row order is by looking at the code. Thanks

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Mingyu Kim
Hi Patrick, I¹m a little confused about your comment that RDDs are not ordered. As far as I know, RDDs keep list of partitions that are ordered and this is why I can call RDD.take() and get the same first k rows every time I call it and RDD.take() returns the same entries as RDD.map(Š).take()