Cool. Using Ambari to monitor and scale up/down the cluster sounds
promising. Thanks for the pointer!
Mingyu
From: Deepak Sharma <deepakmc...@gmail.com>
Date: Monday, December 14, 2015 at 1:53 AM
To: cs user <acldstk...@gmail.com>
Cc: Mingyu Kim <m...@palantir.com>, &quo
Hi all,
Has anyone tried out autoscaling Spark YARN cluster on a public cloud (e.g.
EC2) based on workload? To be clear, I¹m interested in scaling the cluster
itself up and down by adding and removing YARN nodes based on the cluster
resource utilization (e.g. # of applications queued, # of
Hi all,
I filed https://issues.apache.org/jira/browse/SPARK-11081. Since Jersey’s
surface area is relatively small and seems to be only used for Spark UI and
json API, shading the dependency might make sense similar to what’s done for
Jerry dependencies at
with hadoop 2.4.
Thanks.
Darin.
From: Aaron Davidson ilike...@gmail.com
To: Andrew Ash and...@andrewash.com
Cc: Josh Rosen rosenvi...@gmail.com; Mingyu Kim m...@palantir.com;
user@spark.apache.org user@spark.apache.org; Aaron Davidson
aa...@databricks.com
Sent: Saturday
I didn’t get any response. It’d be really appreciated if anyone using a special
OutputCommitter for S3 can comment on this!
Thanks,
Mingyu
From: Mingyu Kim m...@palantir.commailto:m...@palantir.com
Date: Monday, February 16, 2015 at 1:15 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
HI all,
The default OutputCommitter used by RDD, which is FileOutputCommitter, seems to
require moving files at the commit step, which is not a constant operation in
S3, as discussed in
http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3c543e33fa.2000...@entropy.be%3E.
People
, writing to HDFS / S3 is idempotent.
Now this logic is already implemented within the Hadoop's MapReduce logic, and
Spark just uses it directly.
TD
On Tue, Jul 15, 2014 at 2:33 PM, Mingyu Kim m...@palantir.com wrote:
Thanks for the explanation, guys.
I looked
Hi all,
I was curious about the details of Spark speculation. So, my understanding
is that, when ³speculated² tasks are newly scheduled on other machines, the
original tasks are still running until the entire stage completes. This
seems to leave some room for duplicated work because some spark
Hi all,
I¹m trying to use JavaRDD.mapToPair(), but it fails with NPE on the
executor. The PairFunction used in the call is null for some reason. Any
comments/help would be appreciated!
My setup is,
* Java 7
* Spark 1.0.0
* Hadoop 2.0.0-mr1-cdh4.6.0
Here¹s the code snippet.
import
Hi all,
Is there any plan for 1.0.1 release?
Mingyu
smime.p7s
Description: S/MIME cryptographic signature
29, 2014 at 10:44 PM, Mingyu Kim m...@palantir.com wrote:
Thanks for the quick response!
To better understand it, the reason sorted RDD has a well-defined
ordering
is because sortedRDD.getPartitions() returns the partitions in the right
order and each partition internally is properly sorted. So
this, it
wouldn't violate the contract of union
AFIAK the only guarentee is the resulting RDD will contain all elements.
- Patrick
On Tue, Apr 29, 2014 at 11:26 PM, Mingyu Kim m...@palantir.com wrote:
Yes, that’s what I meant. Sure, the numbers might not be actually
sorted,
but the order of rows
that the API doesn't do.
On Wed, Apr 30, 2014 at 11:13 AM, Mingyu Kim m...@palantir.com wrote:
Okay, that makes sense. It’d be great if this can be better documented at
some point, because the only way to find out about the resulting RDD row
order is by looking at the code.
Thanks
Hi Patrick,
I¹m a little confused about your comment that RDDs are not ordered. As far
as I know, RDDs keep list of partitions that are ordered and this is why I
can call RDD.take() and get the same first k rows every time I call it and
RDD.take() returns the same entries as RDD.map().take()
14 matches
Mail list logo