Re: problems with standalone cluster

2013-12-10 Thread Cesar Arevalo
Not sure if this will help you or if you've already tried it. But, maybe setting the log levels to debug will give you more information. Hope this helps. -Cesar On Tue, Dec 10, 2013 at 8:40 PM, Umar Javed wrote: > any help regarding this?...thx > > > On Tue, Nov 19, 2013 at 6:13 PM, Umar Jave

I need some help

2013-12-10 Thread leosand...@gmail.com
I have deployed two Spark clusters . The first is a simple standalone cluster which is working well . ( sbt/sbt assembly ) But I built Spark against Hadoop2.0.0-cdh4.2.1 in the second cluster, there seems to be a problem when I start the master ! ( SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt

Re: problems with standalone cluster

2013-12-10 Thread Umar Javed
any help regarding this?...thx On Tue, Nov 19, 2013 at 6:13 PM, Umar Javed wrote: > I have a scala script that I'm trying to run on a Spark standalone cluster > with just one worker (existing on the master node). But the application > just hangs. Here's the worker log output at the time of star

Re: Writing an RDD to Hive

2013-12-10 Thread Philip Ogren
I uncovered a fairly simple solution that I thought I would share for the curious. Hive provides a JDBC driver/client which can be used to execute Hive statements (in my case to drop and create table

Spark hangs on bad Mesos slave

2013-12-10 Thread Gary Malouf
Hi guys, For reference, we are on a master build of spark from November 19 and Mesos 0.13. Periodically, we run into an issue where one of our Mesos slaves takes some tasks from a Spark query and according to the Mesos ui they are stuck in 'STAGING'. This ends up blocking the query from running

Re: Constant out of memory issues

2013-12-10 Thread Patrick Wendell
Spark probably needs more than 1GB of heap space to function correctly. What happens if you give the workers more memory? - Patrick On Tue, Dec 10, 2013 at 2:42 PM, learner1014 all wrote: > > Data is in hdfs, running 2 workers with 1 GB memory > datafile1 is ~9KB and datafile2 is ~216MB. Cant ge

Constant out of memory issues

2013-12-10 Thread learner1014 all
Data is in hdfs, running 2 workers with 1 GB memory datafile1 is ~9KB and datafile2 is ~216MB. Cant get it to run at all... Tried various different settings for the number of tasks, all the way from 2 to 1024. Anyone else seen similar issues. import org.apache.spark.SparkContext import org.apache

Re: reading LZO compressed file in spark

2013-12-10 Thread Stephen Haberman
> System.setProperty("spark.io.compression.codec", > "com.hadoop.compression.lzo.LzopCodec") This spark.io.compression.codec is a completely different setting than the codecs that are used for reading/writing from HDFS. (It is for compressing Spark's internal/non-HDFS intermediate output.) > Hop

Re: Incremental Updates to an RDD

2013-12-10 Thread Christopher Nguyen
Wes, it depends on what you mean by "sliding window" as related to "RDD": 1. Some operation over multiple rows of data within a single, large RDD, for which the operations are required to be temporally sequential. This may be the case where you're computing a running average over historic

Running Spark jar on EC2

2013-12-10 Thread Jeff Higgens
I'm having trouble running my Spark program as a "fat jar" on EC2. This is the process I'm using: (1) spark-ec2 script to launch cluster (2) ssh to master, install sbt and git clone my project's source code (3) update source to reference correct master and jar (4) sbt assembly (5) copy-dir to copy

Re: reading LZO compressed file in spark

2013-12-10 Thread Rajeev Srivastava
Very little. Only think i could find was a "info" blog on haddop + spark used a twitter. It does not contain the details though. A small LZO compressed file (5MB) with index file works with my code. So i know that my code must be working fine but for larger LZO the system chokes trying to uncompre

Re: Incremental Updates to an RDD

2013-12-10 Thread Wes Mitchell
So, does that mean that if I want to do a sliding window, then I have to, in some fashion, build a stream from the RDD, push a new value on the head, filter out the oldest value, and re-persist as an RDD? On Fri, Dec 6, 2013 at 10:13 PM, Christopher Nguyen wrote: > Kyle, the fundamental contr

Re: Hadoop RDD incorrect data

2013-12-10 Thread Matt Cheah
It shouldn't be master only – the data is distributed in HDFS and I'm just invoking sequenceFile() to get the file, map() to copy the data so objects aren't re-used, keyBy() (JavaRDD) followed by sortByKey. In something like Java-scala-ish-pseudo-code: System.setProperty("spark.default.parallel

Re: Spark map performance question

2013-12-10 Thread Yadid Ayzenberg
Thanks Mark, that cleared things up for me. I applied the cache() before the count() and now its behaving as expected. I really appreciate the fast response. Yadid On 12/10/13 12:20 PM, Mark Hamstra wrote: You're not marking rdd1 as cached (actually, to-be-cached-after-next-evaluation) unti

Re: Spark map performance question

2013-12-10 Thread Mark Hamstra
You're not marking rdd1 as cached (actually, to-be-cached-after-next-evaluation) until after rdd1.count; so when you hit rdd2.count, rdd1 is not yet cached (no action has been performed on it since it was marked as cached) and has to be completely re-evaluated. On the other hand, by the time you h

Spark map performance question

2013-12-10 Thread Yadid Ayzenberg
Hi All, I'm trying to understand the performance results I'm getting for the following: rdd = sc.newAPIHadoopRDD( ... ) rdd1 = rdd.keyBy( func1() ) rdd1.count() rdd1.cache() rdd2= rdd1.map(func2()) rdd2.count() rdd3 = rdd2.map(func2()) rdd3.count() I would expect the 2 maps to be more or le

Re: Remote client shutdown error

2013-12-10 Thread Frank Austin Nothaft
This is resolved. There was an issue in my build chain where spark 0.8.0 was getting built into the working application, and then spark 0.7.3 was getting built into the second (failing) application. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Dec 9, 201

Re: groupBy() with really big groups fails

2013-12-10 Thread Mark Hamstra
In a shuffle/reduce, the portion of the intermediate results destined for each of the R reducers and produced by a task run on each of the N partitions of the RDD needs to be materialized and sent to that reducer. So, N tasks each producing and materializing R intermediate results implies N*R files

Re: reading LZO compressed file in spark

2013-12-10 Thread Andrew Ash
I'm interested in doing this too Rajeev. Did you make any progress? On Mon, Dec 9, 2013 at 1:57 PM, Rajeev Srivastava wrote: > Hello experts, > I would like to read a LZO splittable compressed file into spark. > I have followed available material on the web on working with LZO > compressed

Re: Why does sortByKey launch cluster job?

2013-12-10 Thread Josh Rosen
I wonder whether making RangePartitoner .rangeBounds into a lazy val would fix this ( https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). We'd need to make sure that rangeBounds() is never called befor

Re: Why does sortByKey launch cluster job?

2013-12-10 Thread Ryan Prenger
Thanks for the responses! I agree that b seems like it would be better. I could imagine optimizations that could be made if a filter call came after the sortByKey that would make the initial partitioning sub-optimal. Plus this way, it's a pain to use in the REPL. Cheers, Ryan On Tue, Dec 10,

Warning about poor interactions between PySpark and numexpr

2013-12-10 Thread Michael Ronquest
Hi Everyone, I've recently run into some unpleasantness with PySpark when trying to use a pandas DataFrame *inside* a mapPartitions function. I've traced the error to numexpr (which pandas uses) and submitted a bug here: https://code.google.com/p/numexpr/issues/detail?id=1

Re: groupBy() with really big groups fails

2013-12-10 Thread Grega Kešpret
Hi Aaron, thanks for the explanation, I also find it very helpful. On Mon, Dec 9, 2013 at 9:28 PM, Aaron Davidson wrote: > If you have N map partitions and R reducers, we create N*R files on disk > across the cluster in order to do the group by. Do you mind giving a link or explaining why N*R

Re: Why does sortByKey launch cluster job?

2013-12-10 Thread Andrew Ash
Since sortByKey() invokes those right now, we should either a) change the documentation to treat note that it kicks off actions or b) change the method to execute those things lazily. Personally I'd prefer b but don't know how difficult that would be. On Tue, Dec 10, 2013 at 1:52 AM, Jason Lende

Starting with spark

2013-12-10 Thread Ravi Hemnani
Hey, I am trying to run *sudo ./run-example org.apache.spark.examples.JavaWordCount* and it keeps throwing error, ./run-example: line 42: [: /opt/spark/spark-0.8.0-incubating/examples/target/scala-2.9.3/spark-examples_2.9.3-assembly-0.8.0-incubating.jar: binary operator expected Except