date:20131210

Re: problems with standalone cluster

2013-12-10 Thread Cesar Arevalo

Not sure if this will help you or if you've already tried it. But, maybe setting the log levels to debug will give you more information. Hope this helps. -Cesar On Tue, Dec 10, 2013 at 8:40 PM, Umar Javed wrote: > any help regarding this?...thx > > > On Tue, Nov 19, 2013 at 6:13 PM, Umar Jave

I need some help

2013-12-10 Thread leosand...@gmail.com

I have deployed two Spark clusters . The first is a simple standalone cluster which is working well . ( sbt/sbt assembly ) But I built Spark against Hadoop2.0.0-cdh4.2.1 in the second cluster, there seems to be a problem when I start the master ! ( SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt

Re: problems with standalone cluster

2013-12-10 Thread Umar Javed

any help regarding this?...thx On Tue, Nov 19, 2013 at 6:13 PM, Umar Javed wrote: > I have a scala script that I'm trying to run on a Spark standalone cluster > with just one worker (existing on the master node). But the application > just hangs. Here's the worker log output at the time of star

Re: Writing an RDD to Hive

2013-12-10 Thread Philip Ogren

I uncovered a fairly simple solution that I thought I would share for the curious. Hive provides a JDBC driver/client which can be used to execute Hive statements (in my case to drop and create table

Spark hangs on bad Mesos slave

2013-12-10 Thread Gary Malouf

Hi guys, For reference, we are on a master build of spark from November 19 and Mesos 0.13. Periodically, we run into an issue where one of our Mesos slaves takes some tasks from a Spark query and according to the Mesos ui they are stuck in 'STAGING'. This ends up blocking the query from running

Re: Constant out of memory issues

2013-12-10 Thread Patrick Wendell

Spark probably needs more than 1GB of heap space to function correctly. What happens if you give the workers more memory? - Patrick On Tue, Dec 10, 2013 at 2:42 PM, learner1014 all wrote: > > Data is in hdfs, running 2 workers with 1 GB memory > datafile1 is ~9KB and datafile2 is ~216MB. Cant ge

Constant out of memory issues

2013-12-10 Thread learner1014 all

Data is in hdfs, running 2 workers with 1 GB memory datafile1 is ~9KB and datafile2 is ~216MB. Cant get it to run at all... Tried various different settings for the number of tasks, all the way from 2 to 1024. Anyone else seen similar issues. import org.apache.spark.SparkContext import org.apache

Re: reading LZO compressed file in spark

2013-12-10 Thread Stephen Haberman

> System.setProperty("spark.io.compression.codec", > "com.hadoop.compression.lzo.LzopCodec") This spark.io.compression.codec is a completely different setting than the codecs that are used for reading/writing from HDFS. (It is for compressing Spark's internal/non-HDFS intermediate output.) > Hop

Re: Incremental Updates to an RDD

2013-12-10 Thread Christopher Nguyen

Wes, it depends on what you mean by "sliding window" as related to "RDD": 1. Some operation over multiple rows of data within a single, large RDD, for which the operations are required to be temporally sequential. This may be the case where you're computing a running average over historic

Running Spark jar on EC2

2013-12-10 Thread Jeff Higgens

I'm having trouble running my Spark program as a "fat jar" on EC2. This is the process I'm using: (1) spark-ec2 script to launch cluster (2) ssh to master, install sbt and git clone my project's source code (3) update source to reference correct master and jar (4) sbt assembly (5) copy-dir to copy

Re: reading LZO compressed file in spark

2013-12-10 Thread Rajeev Srivastava

Very little. Only think i could find was a "info" blog on haddop + spark used a twitter. It does not contain the details though. A small LZO compressed file (5MB) with index file works with my code. So i know that my code must be working fine but for larger LZO the system chokes trying to uncompre

Re: Incremental Updates to an RDD

2013-12-10 Thread Wes Mitchell

So, does that mean that if I want to do a sliding window, then I have to, in some fashion, build a stream from the RDD, push a new value on the head, filter out the oldest value, and re-persist as an RDD? On Fri, Dec 6, 2013 at 10:13 PM, Christopher Nguyen wrote: > Kyle, the fundamental contr

Re: Hadoop RDD incorrect data

2013-12-10 Thread Matt Cheah

It shouldn't be master only – the data is distributed in HDFS and I'm just invoking sequenceFile() to get the file, map() to copy the data so objects aren't re-used, keyBy() (JavaRDD) followed by sortByKey. In something like Java-scala-ish-pseudo-code: System.setProperty("spark.default.parallel

Re: Spark map performance question

2013-12-10 Thread Yadid Ayzenberg

Thanks Mark, that cleared things up for me. I applied the cache() before the count() and now its behaving as expected. I really appreciate the fast response. Yadid On 12/10/13 12:20 PM, Mark Hamstra wrote: You're not marking rdd1 as cached (actually, to-be-cached-after-next-evaluation) unti

Re: Spark map performance question

2013-12-10 Thread Mark Hamstra

You're not marking rdd1 as cached (actually, to-be-cached-after-next-evaluation) until after rdd1.count; so when you hit rdd2.count, rdd1 is not yet cached (no action has been performed on it since it was marked as cached) and has to be completely re-evaluated. On the other hand, by the time you h

Spark map performance question

2013-12-10 Thread Yadid Ayzenberg

Hi All, I'm trying to understand the performance results I'm getting for the following: rdd = sc.newAPIHadoopRDD( ... ) rdd1 = rdd.keyBy( func1() ) rdd1.count() rdd1.cache() rdd2= rdd1.map(func2()) rdd2.count() rdd3 = rdd2.map(func2()) rdd3.count() I would expect the 2 maps to be more or le

Re: Remote client shutdown error

2013-12-10 Thread Frank Austin Nothaft

This is resolved. There was an issue in my build chain where spark 0.8.0 was getting built into the working application, and then spark 0.7.3 was getting built into the second (failing) application. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Dec 9, 201

Re: groupBy() with really big groups fails

2013-12-10 Thread Mark Hamstra

In a shuffle/reduce, the portion of the intermediate results destined for each of the R reducers and produced by a task run on each of the N partitions of the RDD needs to be materialized and sent to that reducer. So, N tasks each producing and materializing R intermediate results implies N*R files

Re: reading LZO compressed file in spark

2013-12-10 Thread Andrew Ash

I'm interested in doing this too Rajeev. Did you make any progress? On Mon, Dec 9, 2013 at 1:57 PM, Rajeev Srivastava wrote: > Hello experts, > I would like to read a LZO splittable compressed file into spark. > I have followed available material on the web on working with LZO > compressed

Re: Why does sortByKey launch cluster job?

2013-12-10 Thread Josh Rosen

I wonder whether making RangePartitoner .rangeBounds into a lazy val would fix this ( https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). We'd need to make sure that rangeBounds() is never called befor

Re: Why does sortByKey launch cluster job?

2013-12-10 Thread Ryan Prenger

Thanks for the responses! I agree that b seems like it would be better. I could imagine optimizations that could be made if a filter call came after the sortByKey that would make the initial partitioning sub-optimal. Plus this way, it's a pain to use in the REPL. Cheers, Ryan On Tue, Dec 10,

Warning about poor interactions between PySpark and numexpr

2013-12-10 Thread Michael Ronquest

Hi Everyone, I've recently run into some unpleasantness with PySpark when trying to use a pandas DataFrame *inside* a mapPartitions function. I've traced the error to numexpr (which pandas uses) and submitted a bug here: https://code.google.com/p/numexpr/issues/detail?id=1

Re: groupBy() with really big groups fails

2013-12-10 Thread Grega Kešpret

Hi Aaron, thanks for the explanation, I also find it very helpful. On Mon, Dec 9, 2013 at 9:28 PM, Aaron Davidson wrote: > If you have N map partitions and R reducers, we create N*R files on disk > across the cluster in order to do the group by. Do you mind giving a link or explaining why N*R

Re: Why does sortByKey launch cluster job?

2013-12-10 Thread Andrew Ash

Since sortByKey() invokes those right now, we should either a) change the documentation to treat note that it kicks off actions or b) change the method to execute those things lazily. Personally I'd prefer b but don't know how difficult that would be. On Tue, Dec 10, 2013 at 1:52 AM, Jason Lende

Starting with spark

2013-12-10 Thread Ravi Hemnani

Hey, I am trying to run *sudo ./run-example org.apache.spark.examples.JavaWordCount* and it keeps throwing error, ./run-example: line 42: [: /opt/spark/spark-0.8.0-incubating/examples/target/scala-2.9.3/spark-examples_2.9.3-assembly-0.8.0-incubating.jar: binary operator expected Except

Re: problems with standalone cluster

I need some help

Re: problems with standalone cluster

Re: Writing an RDD to Hive

Spark hangs on bad Mesos slave

Re: Constant out of memory issues

Constant out of memory issues

Re: reading LZO compressed file in spark

Re: Incremental Updates to an RDD

Running Spark jar on EC2

Re: reading LZO compressed file in spark

Re: Incremental Updates to an RDD

Re: Hadoop RDD incorrect data

Re: Spark map performance question

Re: Spark map performance question

Spark map performance question

Re: Remote client shutdown error

Re: groupBy() with really big groups fails

Re: reading LZO compressed file in spark

Re: Why does sortByKey launch cluster job?

Re: Why does sortByKey launch cluster job?

Warning about poor interactions between PySpark and numexpr

Re: groupBy() with really big groups fails

Re: Why does sortByKey launch cluster job?

Starting with spark

25 matches

Site Navigation

Mail list logo

Footer information