Hey,
I am trying to get my hands on SPARK and i want to run a JAVA code and
as specified in the Spark documentation
(http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-java
Hi!
I tried setting spark.task.maxFailures to 1 (having this patch
https://github.com/apache/incubator-spark/pull/245) and started a job.
After some time, I killed all JVMs running on one of the two workers. I was
expecting Spark job to fail, however it re-fetched tasks to one of the two
workers
You can compress a csv or tab delimited file as well :)
You can specify the codec of your choice, say snappy, when writing out.
That's what we do. You can also write out data as sequence files. RCFile
should also be possible given the flexibility of Spark API but we haven't
tried that.
On Dec 7,
Any chance you could sketch out the Shark APIs that you use for this?
Matei's response suggests that the preferred API is coming in the next
release (i.e. RDDTable class in 0.8.1). Are you building Shark from the
latest in the repo and using that? Or have you figured out other API
calls
I'd like to use Spark as an analytical stack, the only difference is that I
would like find the best way to connect it to a dataset that I'm actively
working on. Perhaps saying 'updates to an RDD' is a bit of a loaded term, I
don't need the 'resilient', just a distributed data set.
Right now, the
Hello experts,
I would like to read a LZO splittable compressed file into spark.
I have followed available material on the web on working with LZO
compressed data.
I am able to create the index file needed by hadoop.
But i am unable to read the LZO file in spark. I use spark 0.7.2
I would
Kyle, many of your design goals are something we also want. Indeed it's
interesting you separate resilient from RDD, as I've suggested there
should be ways to boost performance if you're willing to give up some or
all of the R guarantees.
We haven't started looking into this yet due to other
Hi all,
Im noticing some strange behavior when running mapPartitions. Pseudo code:
JavaPairRDDObject, Tuple2Object, BSONObject myRDD =
myRDD.mapPartitions( func )
myRDD.count()
ArrayListTuple2Integer, Tuple2ListTuple2Double, Double,
ListTuple2Double, DoubletempRDD =
Neither map nor mapPartitions mutates an RDD -- if you establish an
immutable reference to an rdd (e.g., in Scala, val rdd = ...), the data
contained within that RDD will be the same regardless of any map or
mapPartition transformations. However, if you re-assign the reference to
point to the
That's right, with the minor clarification that Spark will only spill
cached data to disk if it was persisted with a StorageLevel that allows
spilling to disk (e.g. MEM_AND_DISK); otherwise, it will simply drop the
blocks and they'll have to be recomputed from lineage if they're accessed
again.
yes, I understand that I lost the original reference to my RDD.
my questions is regarding the new reference (made by tempRDD). I should
have 2 references to two different points in the lineage chain.
One is myRDD and the other is tempRDD. it seems that after running the
transformation referred
Spark doesn't perform defensive copying before passing cached objects to
your user-defined functions, so if you're caching mutable Java objects and
mutating them in a transformation, the effects of that mutation might
change the cached data and affect other RDDs. Is func2 mutating your
cached
I guess that's the answer then , I am mutating my cached objects.
I will probably need to create a new copy of those objects within the
transformation to avoid this.
On 12/9/13 2:33 PM, Josh Rosen wrote:
Spark doesn't perform defensive copying before passing cached objects
to your
This is very likely due to memory issues. The problem is that each
reducer (partition of the groupBy) builds an in-memory table of that
partition. If you have very few partitions, this will fail, so the solution
is to simply increase the number of reducers. For example:
sc.parallelize(1 to
Whoops – should mention I'm using the scala-2.10 branch. Haven't tried with the
spark release working off of scala-2.9.3.
From: Andrew Winings mch...@palantir.commailto:mch...@palantir.com
Reply-To:
user@spark.incubator.apache.orgmailto:user@spark.incubator.apache.org
Will using a large number of reducers have any performance implications on
certain sized datasets? Ideally I'd like a single number of reducers that will
be optimal for data sets anywhere between 100 GB to several terabytes in size –
although I'm not sure how big the groups themselves will be
Thanks a lot for that. There's definitely a lot of subtleties that we need to
consider. We appreciate the thorough explanation!
-Matt Cheah
From: Aaron Davidson ilike...@gmail.commailto:ilike...@gmail.com
Reply-To:
user@spark.incubator.apache.orgmailto:user@spark.incubator.apache.org
I got lost there a bit the link below still takes new users to old
mailing list
http://spark.incubator.apache.org/faq.html
The Mailing list navigation / header points to the right place. This is
just the one from FAQ page.
--
Harry
Hi Matt,
The behavior for sequenceFile is there because we reuse the same Writable
object when reading elements from the file. This is definitely unintuitive, but
if you pass through each data item only once instead of caching it, it can be
more efficient (probably should be off by default
It is actually in JavaPairRDD class:
http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.api.java.JavaPairRDD
On Mon, Dec 9, 2013 at 10:42 PM, Matt Cheah mch...@palantir.com wrote:
Was this introduced recently? JavaRDD's function signatures don't seem
to take
I am thinking that if your data is sufficiently partitioned prior to
sortByKey(), say by the virtue of a prior groupByKey() or reduceByKey()
call; the sortByKey () that follows should have the number of tasks based
on that number of partitions.
OfCourse setting default parallelism will also work.
That data size is sufficiently small for the cluster configuration that you
mention.
Are you doing the sort in local mode or on master only? Is the default
parallelism
system property being set prior to creating SparkContext?
On Mon, Dec 9, 2013 at 10:45 PM, Matt Cheah mch...@palantir.com wrote:
22 matches
Mail list logo