cached rdd in memory eviction

2014-02-24 Thread Koert Kuipers
i was under the impression that running jobs could not evict cached rdds from memory as long as they are below spark.storage.memoryFraction. however what i observe seems to indicate the opposite. did anything change? thanks! koert

Re: Shared access to RDD

2014-02-17 Thread Koert Kuipers
it is possible to run multiple queries using a shared SparkContext (which holds the shared RDD). however this is not easily available in spark-shell i believe. alternatively tachyon can be used to share (serialized) RDDs On Mon, Feb 17, 2014 at 11:41 AM, David Thomas dt5434...@gmail.com wrote:

default parallelism in trunk

2014-02-01 Thread Koert Kuipers
i just managed to upgrade my 0.9-SNAPSHOT from the last scala 2.9.x version to the latest. everything seems good except that my default parallelism is now set to 2 for jobs instead of some smart number based on the number of cores (i think that is what it used to do). it this change on purpose?

graphx merge for scala 2.9

2013-12-27 Thread Koert Kuipers
since we are still on scala 2.9.x and trunk migrated to 2.10.x i hope graphx will get merged into the 0.8.x series at some point, and not just 0.9.x (which is now scala 2.10), since that would make it hard for us to use in the near future. best, koert

Re: writing to HDFS with a given username

2013-12-13 Thread Koert Kuipers
the master branch. These patch can support to access hdfs with the username you start the Spark application, not the one who starts Spark service. Thanks Jerry *From:* Koert Kuipers [mailto:ko...@tresata.com] *Sent:* Friday, December 13, 2013 8:39 AM *To:* user@spark.incubator.apache.org

Re: writing to HDFS with a given username

2013-12-12 Thread Koert Kuipers
Hey Philip, how do you get spark to write to hdfs with your user name? When i use spark it writes to hdfs as the user that runs the spark services... i wish it read and wrote as me. On Thu, Dec 12, 2013 at 6:37 PM, Philip Ogren philip.og...@oracle.comwrote: When I call

Re: 0.9-SNAPSHOT StageInfo

2013-11-29 Thread Koert Kuipers
message as to why the calculation failed (as opposed to: fetch failed more than 4 times). On Fri, Nov 29, 2013 at 3:09 PM, Koert Kuipers ko...@tresata.com wrote: in 0.9-SNAPSHOT StageInfo has been changed to make the stage itself no longer accessible. however the stage contains the rdd, which

Re: Does spark RDD has a partitionedByKey

2013-11-16 Thread Koert Kuipers
we use PartitionBy a lot to keep multiple datasets co-partitioned before caching. it works well. On Sat, Nov 16, 2013 at 5:10 AM, guojc guoj...@gmail.com wrote: After looking at the api more carefully, I just found I overlooked the partitionBy function on PairRDDFunction. It's the function

Re: Does spark RDD has a partitionedByKey

2013-11-16 Thread Koert Kuipers
in fact co-partitioning was one of the main reason we started using spark. in map-reduce its a giant pain to implement On Sat, Nov 16, 2013 at 3:05 PM, Koert Kuipers ko...@tresata.com wrote: we use PartitionBy a lot to keep multiple datasets co-partitioned before caching. it works well

Re: compare/contrast Spark with Cascading

2013-10-29 Thread Koert Kuipers
scrapco...@gmail.comwrote: Hey Koert, Can you give me steps to reproduce this ? On Tue, Oct 29, 2013 at 10:06 AM, Koert Kuipers ko...@tresata.com wrote: Matei, We have some jobs where even the input for a single key in a groupBy would not fit in the the tasks memory. We rely on mapred to stream

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Koert Kuipers
it. Matei On Oct 28, 2013, at 5:32 PM, Koert Kuipers ko...@tresata.com wrote: no problem :) i am actually not familiar with what oscar has said on this. can you share or point me to the conversation thread? it is my opinion based on the little experimenting i have done. but i am willing

spark 0.8

2013-10-17 Thread Koert Kuipers
after upgrading from spark 0.7 to spark 0.8 i can no longer access any files on HDFS. i see the error below. any ideas? i am running spark standalone on a cluster that also has CDH4.3.0 and rebuild spark accordingly. the jars in lib_managed look good to me. i noticed similar errors in the

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
now, it works.. On Thu, Oct 17, 2013 at 6:05 PM, Koert Kuipers ko...@tresata.com wrote: after upgrading from spark 0.7 to spark 0.8 i can no longer access any files on HDFS. i see the error below. any ideas? i am running spark standalone on a cluster that also has CDH4.3.0 and rebuild spark

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
for your version of Hadoop. See http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala for example. Matei On Oct 17, 2013, at 4:38 PM, Koert Kuipers ko...@tresata.com wrote: i got the job a little further along by also setting this: System.setProperty

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
i have my spark and hadoop related dependencies as provided for my spark job. this used to work with previous versions. are these now supposed to be compile/runtime/default dependencies? On Thu, Oct 17, 2013 at 8:04 PM, Koert Kuipers ko...@tresata.com wrote: yes i did that and i can see