Re: sparkR - is it possible to run sparkR on yarn?

2014-04-30 Thread Shivaram Venkataraman
We don't have any documentation on running SparkR on YARN and I think there might be some issues that need to be fixed (The recent PySpark on YARN PRs are an example). SparkR has only been tested to work with Spark standalone mode so far. Thanks Shivaram On Tue, Apr 29, 2014 at 7:56 PM,

Setting spark.locality.wait.node parameter in interactive shell

2014-04-30 Thread Sai Prasanna
Hi, Any suggestion to the following issue ?? I have replication factor 3 in my HDFS. With 3 datanodes, i ran my experiments. Now i just added another node to it with no data in it. When i ran, SPARK launches non-local tasks in it and the time taken is more than what it took for 3 node cluster.

RE: How fast would you expect shuffle serialize to be?

2014-04-30 Thread Liu, Raymond
I just tried to use serializer to write object directly in local mode with code: val datasize = args(1).toInt val dataset = (0 until datasize).map( i = (asmallstring, i)) val out: OutputStream = { new BufferedOutputStream(new FileOutputStream(args(2)), 1024 * 100)

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mingyu Kim
Yes, that’s what I meant. Sure, the numbers might not be actually sorted, but the order of rows semantically are kept throughout non-shuffling transforms. I’m on board with you on union as well. Back to the original question, then, why is it important to coalesce to a single partition? When you

Re: NoSuchMethodError from Spark Java

2014-04-30 Thread wxhsdp
i fixed it. i make my sbt project depend on spark/trunk/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar and it works -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-from-Spark-Java-tp4937p5096.html Sent from the

Re: Shuffle Spill Issue

2014-04-30 Thread Daniel Darabos
Whoops, you are right. Sorry for the misinformation. Indeed reduceByKey just calls combineByKey: def reduceByKey(partitioner: Partitioner, func: (V, V) = V): RDD[(K, V)] = { combineByKey[V]((v: V) = v, func, func, partitioner) } (I think I confused reduceByKey with groupByKey.) On Wed, Apr

the spark configuage

2014-04-30 Thread Sophia
Hi, when I configue spark, run the shell instruction: ./spark-shellit told me like this: WARN:NativeCodeLoader:Uable to load native-hadoop livrary for your builtin-java classes where applicable,when it connect to ResourceManager,it stopped. What should I DO? Wish your reply -- View this

Re: Joining not-pair RDDs in Spark

2014-04-30 Thread jsantos
That's the approach I finally used. Thanks for your help :-) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-not-pair-RDDs-in-Spark-tp5034p5099.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: the spark configuage

2014-04-30 Thread Akhil Das
Hi The reason you saw that warning is the native Hadoop library $HADOOP_HOME/lib/native/libhadoop.so.1.0.0 was actually compiled on 32 bit. Anyway, it's just a warning, and won't impact Hadoop's functionalities. Here is the way if you do want to eliminate this warning, download the source code

Re: the spark configuage

2014-04-30 Thread Andras Nemeth
On 30 Apr 2014 10:35, Akhil Das ak...@sigmoidanalytics.com wrote: Hi The reason you saw that warning is the native Hadoop library $HADOOP_HOME/lib/native/libhadoop.so.1.0.0 was actually compiled on 32 bit. Anyway, it's just a warning, and won't impact Hadoop's functionalities. Here is the

new Washington DC Area Spark Meetup

2014-04-30 Thread Donna-M. Fernandez
Hi, all! For those in the Washington DC area (DC/MD/VA), we just started a new Spark Meetup. We'd love for you to join! -d Here's the link: http://www.meetup.com/Washington-DC-Area-Spark-Interactive/ Description: This is an interactive meetup for Washington DC, Virginia and Maryland users,

Re: the spark configuage

2014-04-30 Thread Diana Carroll
I'm guessing your shell stopping when it attempts to connect to the RM is not related to that warning. You'll get that message out of the box from Spark if you don't have HADOOP_HOME set correctly. I'm using CDH 5.0 installed in default locations, and got rid of the warning by setting

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mingyu Kim
Okay, that makes sense. It’d be great if this can be better documented at some point, because the only way to find out about the resulting RDD row order is by looking at the code. Thanks for the discussion! Mingyu On 4/29/14, 11:59 PM, Patrick Wendell pwend...@gmail.com wrote: I don't think

Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Peter
Hi Playing around with Spark S3, I'm opening multiple objects (CSV files) with:     val hfile = sc.textFile(s3n://bucket/2014-04-28/) so hfile is a RDD representing 10 objects that were underneath 2014-04-28. After I've sorted and otherwise transformed the content, I'm trying to write it

Re: Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Peter
Ah, looks like RDD.coalesce(1) solves one part of the problem. On Wednesday, April 30, 2014 11:15 AM, Peter thenephili...@yahoo.com wrote: Hi Playing around with Spark S3, I'm opening multiple objects (CSV files) with:     val hfile = sc.textFile(s3n://bucket/2014-04-28/) so hfile is a RDD

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mingyu Kim
I agree with you in general that as an API user, I shouldn’t be relying on code. However, without looking at the code, there is no way for me to find out even whether map() keeps the row order. Without the knowledge at all, I’d need to do “sort” every time I need certain things in a certain order.

Re: Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Nicholas Chammas
Yes, saveAsTextFile() will give you 1 part per RDD partition. When you coalesce(1), you move everything in the RDD to a single partition, which then gives you 1 output file. It will still be called part-0 or something like that because that’s defined by the Hadoop API that Spark uses for

Re: What is Seq[V] in updateStateByKey?

2014-04-30 Thread Sean Owen
S is the previous count, if any. Seq[V] are potentially many new counts. All of them have to be added together to keep an accurate total. It's as if the count were 3, and I tell you I've just observed 2, 5, and 1 additional occurrences -- the new count is 3 + (2+5+1) not 1 + 1. I butted in

Re: Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Peter
Thanks Nicholas, this is a bit of a shame, not very practical for log roll up for example when every output needs to be in it's own directory.  On Wednesday, April 30, 2014 12:15 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Yes, saveAsTextFile() will give you 1 part per RDD

Re: What is Seq[V] in updateStateByKey?

2014-04-30 Thread Tathagata Das
Yeah, I remember changing fold to sum in a few places, probably in testsuites, but missed this example I guess. On Wed, Apr 30, 2014 at 1:29 PM, Sean Owen so...@cloudera.com wrote: S is the previous count, if any. Seq[V] are potentially many new counts. All of them have to be added together

Re: NoSuchMethodError from Spark Java

2014-04-30 Thread Marcelo Vanzin
Hi, One thing you can do is set the spark version your project depends on to 1.0.0-SNAPSHOT (make sure it matches the version of Spark you're building); then before building your project, run sbt publishLocal on the Spark tree. On Wed, Apr 30, 2014 at 12:11 AM, wxhsdp wxh...@gmail.com wrote: i

[ANN]: Scala By The Bay Conference ( aka Silicon Valley Scala Symposium)

2014-04-30 Thread Chester Chen
Hi,        This is not related to Spark. But I thought you might be interested in the  second SF Scala conference is coming this August. The SF Scala conference was called Sillicon Valley Scala Symposium last year.  From now on, it will be known as Scala By The Bay. 

My talk on Spark: The Next Top (Compute) Model

2014-04-30 Thread Dean Wampler
I meant to post this last week, but this is a talk I gave at the Philly ETE conf. last week: http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model Also here: http://polyglotprogramming.com/papers/Spark-TheNextTopComputeModel.pdf dean -- Dean Wampler, Ph.D. Typesafe

Re: Any advice for using big spark.cleaner.delay value in Spark Streaming?

2014-04-30 Thread buremba
Thanks for your reply. Sorry for the late response, I wanted to do some tests before writing back. The counting part works similar to your advice. I specify a minimum interval like 1 minute, in each hour, day etc. it sums all counters of the current children intervals. However when I want to

update of RDDs

2014-04-30 Thread narayanabhatla NarasimhaMurthy
In our application, we need distributed RDDs containing key-value maps. We have operations that update RDDs by way of adding entries to the map, delete entries from the map as well as update value part of maps. We also have map reduce functions that operate on the RDDs.The questions are the

Re: Strange lookup behavior. Possible bug?

2014-04-30 Thread Yadid Ayzenberg
Dear Sparkers, Has anyone got any insight on this ? I am really stuck. Yadid On 4/28/14, 11:28 AM, Yadid Ayzenberg wrote: Thanks for your answer. I tried running on a single machine - master and worker on one host. I get exactly the same results. Very little CPU activity on the machine in

CDH 5.0 and Spark 0.9.0

2014-04-30 Thread Paul Schooss
Hello, So I was unable to run the following commands from the spark shell with CDH 5.0 and spark 0.9.0, see below. Once I removed the property property nameio.compression.codec.lzo.class/name valuecom.hadoop.compression.lzo.LzoCodec/value finaltrue/final /property from the core-site.xml on the

same partition id means same location?

2014-04-30 Thread wxhsdp
Hi, i'am just reviewing advanced spark features. it's about the pagerank example. it said any shuffle operation on two RDDs will take on the partitioner of one of them, if one is set. so first we partition the Links by hashPartitioner, then we join the Links and Ranks0. Ranks0 will take

Re: Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Patrick Wendell
This is a consequence of the way the Hadoop files API works. However, you can (fairly easily) add code to just rename the file because it will always produce the same filename. (heavy use of pseudo code) dir = /some/dir rdd.coalesce(1).saveAsTextFile(dir) f = new File(dir + part-0)

How to handle this situation: Huge File Shared by All maps and Each Computer Has one copy?

2014-04-30 Thread PengWeiPRC
Hi there, I was wondering if somebody could give me some suggestions about how to handle this situation: I have a spark program, in which it reads a 6GB file first (Not RDD) locally, and then do the map/reduce tasks. This 6GB file contains information that will be shared by all the map tasks.