Building standalone app in Java for Spark

2013-12-09 Thread Ravi Hemnani
Hey, I am trying to get my hands on SPARK and i want to run a JAVA code and as specified in the Spark documentation (http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-java

Re: Resubmision due to a fetch failure

2013-12-09 Thread Grega Kešpret
Hi! I tried setting spark.task.maxFailures to 1 (having this patch https://github.com/apache/incubator-spark/pull/245) and started a job. After some time, I killed all JVMs running on one of the two workers. I was expecting Spark job to fail, however it re-fetched tasks to one of the two workers

Re: data storage formats

2013-12-09 Thread Ashish Rangole
You can compress a csv or tab delimited file as well :) You can specify the codec of your choice, say snappy, when writing out. That's what we do. You can also write out data as sequence files. RCFile should also be possible given the flexibility of Spark API but we haven't tried that. On Dec 7,

Re: Writing an RDD to Hive

2013-12-09 Thread Philip Ogren
Any chance you could sketch out the Shark APIs that you use for this? Matei's response suggests that the preferred API is coming in the next release (i.e. RDDTable class in 0.8.1). Are you building Shark from the latest in the repo and using that? Or have you figured out other API calls

Re: Incremental Updates to an RDD

2013-12-09 Thread Kyle Ellrott
I'd like to use Spark as an analytical stack, the only difference is that I would like find the best way to connect it to a dataset that I'm actively working on. Perhaps saying 'updates to an RDD' is a bit of a loaded term, I don't need the 'resilient', just a distributed data set. Right now, the

reading LZO compressed file in spark

2013-12-09 Thread Rajeev Srivastava
Hello experts, I would like to read a LZO splittable compressed file into spark. I have followed available material on the web on working with LZO compressed data. I am able to create the index file needed by hadoop. But i am unable to read the LZO file in spark. I use spark 0.7.2 I would

Re: Incremental Updates to an RDD

2013-12-09 Thread Christopher Nguyen
Kyle, many of your design goals are something we also want. Indeed it's interesting you separate resilient from RDD, as I've suggested there should be ways to boost performance if you're willing to give up some or all of the R guarantees. We haven't started looking into this yet due to other

JavaPairRDD mapPartitions side effects

2013-12-09 Thread Yadid Ayzenberg
Hi all, Im noticing some strange behavior when running mapPartitions. Pseudo code: JavaPairRDDObject, Tuple2Object, BSONObject myRDD = myRDD.mapPartitions( func ) myRDD.count() ArrayListTuple2Integer, Tuple2ListTuple2Double, Double, ListTuple2Double, DoubletempRDD =

Re: JavaPairRDD mapPartitions side effects

2013-12-09 Thread Mark Hamstra
Neither map nor mapPartitions mutates an RDD -- if you establish an immutable reference to an rdd (e.g., in Scala, val rdd = ...), the data contained within that RDD will be the same regardless of any map or mapPartition transformations. However, if you re-assign the reference to point to the

Re: cache()ing local variables?

2013-12-09 Thread Josh Rosen
That's right, with the minor clarification that Spark will only spill cached data to disk if it was persisted with a StorageLevel that allows spilling to disk (e.g. MEM_AND_DISK); otherwise, it will simply drop the blocks and they'll have to be recomputed from lineage if they're accessed again.

Re: JavaPairRDD mapPartitions side effects

2013-12-09 Thread Yadid Ayzenberg
yes, I understand that I lost the original reference to my RDD. my questions is regarding the new reference (made by tempRDD). I should have 2 references to two different points in the lineage chain. One is myRDD and the other is tempRDD. it seems that after running the transformation referred

Re: JavaPairRDD mapPartitions side effects

2013-12-09 Thread Josh Rosen
Spark doesn't perform defensive copying before passing cached objects to your user-defined functions, so if you're caching mutable Java objects and mutating them in a transformation, the effects of that mutation might change the cached data and affect other RDDs. Is func2 mutating your cached

Re: JavaPairRDD mapPartitions side effects

2013-12-09 Thread Yadid Ayzenberg
I guess that's the answer then , I am mutating my cached objects. I will probably need to create a new copy of those objects within the transformation to avoid this. On 12/9/13 2:33 PM, Josh Rosen wrote: Spark doesn't perform defensive copying before passing cached objects to your

Re: groupBy() with really big groups fails

2013-12-09 Thread Aaron Davidson
This is very likely due to memory issues. The problem is that each reducer (partition of the groupBy) builds an in-memory table of that partition. If you have very few partitions, this will fail, so the solution is to simply increase the number of reducers. For example: sc.parallelize(1 to

Re: groupBy() with really big groups fails

2013-12-09 Thread Matt Cheah
Whoops – should mention I'm using the scala-2.10 branch. Haven't tried with the spark release working off of scala-2.9.3. From: Andrew Winings mch...@palantir.commailto:mch...@palantir.com Reply-To: user@spark.incubator.apache.orgmailto:user@spark.incubator.apache.org

Re: groupBy() with really big groups fails

2013-12-09 Thread Matt Cheah
Will using a large number of reducers have any performance implications on certain sized datasets? Ideally I'd like a single number of reducers that will be optimal for data sets anywhere between 100 GB to several terabytes in size – although I'm not sure how big the groups themselves will be

Re: groupBy() with really big groups fails

2013-12-09 Thread Matt Cheah
Thanks a lot for that. There's definitely a lot of subtleties that we need to consider. We appreciate the thorough explanation! -Matt Cheah From: Aaron Davidson ilike...@gmail.commailto:ilike...@gmail.com Reply-To: user@spark.incubator.apache.orgmailto:user@spark.incubator.apache.org

FYI: mailing list address on FAQ page still points to google

2013-12-09 Thread Harry B
I got lost there a bit the link below still takes new users to old mailing list http://spark.incubator.apache.org/faq.html The Mailing list navigation / header points to the right place. This is just the one from FAQ page. -- Harry

Re: Hadoop RDD incorrect data

2013-12-09 Thread Matei Zaharia
Hi Matt, The behavior for sequenceFile is there because we reuse the same Writable object when reading elements from the file. This is definitely unintuitive, but if you pass through each data item only once instead of caching it, it can be more efficient (probably should be off by default

Re: JavaRDD, Specify number of tasks

2013-12-09 Thread Ashish Rangole
It is actually in JavaPairRDD class: http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.api.java.JavaPairRDD On Mon, Dec 9, 2013 at 10:42 PM, Matt Cheah mch...@palantir.com wrote: Was this introduced recently? JavaRDD's function signatures don't seem to take

Re: JavaRDD, Specify number of tasks

2013-12-09 Thread Ashish Rangole
I am thinking that if your data is sufficiently partitioned prior to sortByKey(), say by the virtue of a prior groupByKey() or reduceByKey() call; the sortByKey () that follows should have the number of tasks based on that number of partitions. OfCourse setting default parallelism will also work.

Re: Hadoop RDD incorrect data

2013-12-09 Thread Ashish Rangole
That data size is sufficiently small for the cluster configuration that you mention. Are you doing the sort in local mode or on master only? Is the default parallelism system property being set prior to creating SparkContext? On Mon, Dec 9, 2013 at 10:45 PM, Matt Cheah mch...@palantir.com wrote: