Forcing RDD computation with something else than count() ?

2014-01-21 Thread Guillaume Pitel
Hi, I'm struggling a bit with something : I have several datasets RDD[((Int,Int),Double)] that I want to merge. I've tried with union+reduceByKey and cogroup+mapValues, but in all cases it seems that if I don't force the computation of the RDD, the final

Re: Print in JavaNetworkWordCount

2014-01-21 Thread Eduardo Costa Alfaia
Thanks again Tathagata for your help Best Regards On Jan 20, 2014, at 19:11, Tathagata Das tathagata.das1...@gmail.com wrote: Hi Eduardo, You can do arbitrary stuff with the data in a DStream using the operation foreachRDD. yourDStream.foreachRDD(rdd = { // Get and print first n

Re: Quality of documentation (rant)

2014-01-21 Thread Ognen Duzlevski
On Mon, Jan 20, 2014 at 11:05 PM, Ognen Duzlevski og...@nengoiksvelzud.comwrote: Thanks. I will try that but your assumption is that something is failing in an obvious way with a message. By the look of the spark-shell - just frozen I would say something is stuck. Will report back. Given

How to clean up jars on worker nodes

2014-01-21 Thread Mingyu Kim
Hi all, I¹d like the added jars on worker nodes (i.e. SparkContext.addJar()) to be cleaned up on tear down. However, SparkContext.stop() doesn¹t seem to delete them. What would be the best way to clear them? Or, is there an easy way to add this functionality? Mingyu smime.p7s Description:

Re: Forcing RDD computation with something else than count() ?

2014-01-21 Thread madhu phatak
Hi, You can call less expensive operations like first or take to trigger the computation. On Tue, Jan 21, 2014 at 2:32 PM, Guillaume Pitel guillaume.pi...@exensa.com wrote: Hi, I'm struggling a bit with something : I have several datasets RDD[((Int,Int),Double)] that I want to merge.

Re: Spark Master on Hadoop Job Tracker?

2014-01-21 Thread mharwida
Many thanks for the replies. The way I currently have my set up is as follows; 6 nodes running Hadoop with each node having approximately 5GB of data. Launched a Spark Master (and Shark via ./shark) on one of the Hadoop nodes and launched 5 worker Spark nodes on the remaining 5 Hadoop nodes. So

How to stop a streaming job

2014-01-21 Thread prabeesh k
How to stop a streaming job using scc.stop. how to it pass it into running code. Want to schedule a streaming job (by specifying start and stop time) based on user input. If user want to stop the streaming then a request scc.stop to running streaming job.

Re: Forcing RDD computation with something else than count() ?

2014-01-21 Thread Guillaume Pitel
Thanks. So you mean that first() trigger the computation of the WHOLE RDD ? That does not sound right, I thought it was lazy. Guillaume Hi, You can call less expensive operations like first or take to trigger the computation.

Spark on private network

2014-01-21 Thread goi cto
Hi, I am newbe for Spark and starting to setup my development environment. first thing I noticed is that I need to be connected to the internet while compiling, *is that right?* *How can I setup my development environment on a closed network?* Thanks, -- Eran | CTO -- Eran | CTO

Re: Spark on private network

2014-01-21 Thread Ognen Duzlevski
What I did on the VPC in Amazon is allow all outgoing traffic from it to the world and allow all traffic to flow freely on the ingress WITHIN my subnet (so for example, if your subnet is 10.10.0.0/24, you allow all machines on the network to connect to each other on any port but that's about all

Re: Want some Metrics

2014-01-21 Thread Pankaj Mittal
Hi, We enabled JmxSink in metric.properties, but we see minimal metrics like memory usage and all. How can we introduce metrics at application level like throughput (message processed / sec). We tried to create MetricRegistry and increment counter in Map function but that doesn't help. Or is

Re: Lzo + Protobuf

2014-01-21 Thread Nick Pentreath
Hi Vipul Have you tried looking at the HBase and Cassandra examples under the spark example project? These use custom InputFormats and may provide guidance as to how to go about using the relevant Protobuf inputformat. On Mon, Jan 20, 2014 at 11:48 PM, Vipul Pandey vipan...@gmail.com wrote:

Re: Lzo + Protobuf

2014-01-21 Thread Issac Buenrostro
Hi Vipul, I use something like this to read from LZO compressed text files, it may be helpful: import com.twitter.elephantbird.mapreduce.input.LzoTextInputFormat import org.apache.hadoop.io.{LongWritable, Text} import org.apache.hadoop.mapreduce.Job val sc = new SparkContext(sparkMaster,

Re: Spark on google compute engine

2014-01-21 Thread Andre Schumacher
Hi, Sorry for the late reply, this kind of slipped under my radar. On 01/13/2014 12:12 PM, Aureliano Buendia wrote: On Mon, Jan 13, 2014 at 5:59 PM, Josh Rosen rosenvi...@gmail.com wrote: If you'd like to use Spark with Docker, the AMPLab's Docker scripts might be a nice starting point:

Re: How to perform multi dimensional reduction in spark?

2014-01-21 Thread Aureliano Buendia
Surprisingly, this turned out to be more complicated than what I expected. I had the impression that this would be trivial in spark. Am I missing something here? On Tue, Jan 21, 2014 at 5:42 AM, Aureliano Buendia buendia...@gmail.comwrote: Hi, It seems spark does not support nested RDD's,

Lazy evaluation of RDD data transformation

2014-01-21 Thread DB Tsai
Hi, When the data is read from HDFS using textFile, and then map function is performed as the following code to make the format right in order to feed it into mllib training algorithms. rddFile = sc.textFile(Some file on HDFS) rddData = rddFile.map(line = { val temp =

Re: Lazy evaluation of RDD data transformation

2014-01-21 Thread Matei Zaharia
If you don’t cache the RDD, the computation will happen over and over each time we scan through it. This is done to save memory in that case and because Spark can’t know at the beginning whether you plan to access a dataset multiple times. If you’d like to prevent this, use cache(), or maybe

Re: Lazy evaluation of RDD data transformation

2014-01-21 Thread DB Tsai
Hi Matei, It does make sense that the computation will happen over and over each time. If I understand correctly, do you mean that it will only compute the transformation of one particular line and then destroy it to save the memory? Or it will create the full result of the map operation, and

Re: OOM - Help Optimizing Local Job

2014-01-21 Thread Brad Ruderman
Hi Tathagata- Thanks for your help. I appreciate your suggestions. I am having trouble following the code for the mapPartitions According to the documentation I need to pass an a function of type iter[t], thus I receive an error on the foreach. var textFile = sc.textFile(input.txt)

Re: Spark streaming on YARN?

2014-01-21 Thread Mike Percy
Just FYI I got it working in standalone mode with no probs, so thanks for the help Tathagata. I was not able to get it working on YARN, I gave up. Thanks, Mike On Fri, Jan 10, 2014 at 6:17 PM, Tathagata Das tathagata.das1...@gmail.comwrote: Let me know if you have any trouble getting it

spark.default.parallelism

2014-01-21 Thread Ognen Duzlevski
This is what docs/configuration.md says about the property: Default number of tasks to use for distributed shuffle operations (codegroupByKey/code, codereduceByKey/code, etc) when not set by user. If I set this property to, let's say, 4 - what does this mean? 4 tasks per core, per worker,

Re: spark.default.parallelism

2014-01-21 Thread Andrew Ash
Documentation suggestion: Default number of tasks to use *across the cluster* for distributed shuffle operations (codegroupByKey/code, codereduceByKey/code, etc) when not set by user. Ognen would that have clarified for you? On Tue, Jan 21, 2014 at 3:35 PM, Matei Zaharia

Shark 0.8.1-rc0 not able to connect to master spark 0.8.1-incubating HADOOP=2.2.0

2014-01-21 Thread Andre Kuhnen
Hello I'm trying to run shark 0.8.1-rc0 with spark 0.8.1 but i have lucky. I use HADOOP 2.2.0 I set the SHARK_HADOOP_VERSION=2.2.0 sbt/sbt package but I am not able to connect to the master. I have hadoop 2.2.0 running perfectly and also spark 0.8.1 standalone. The shell works perfectly,

Re: spark.default.parallelism

2014-01-21 Thread Ognen Duzlevski
On Tue, Jan 21, 2014 at 10:37 PM, Andrew Ash and...@andrewash.com wrote: Documentation suggestion: Default number of tasks to use *across the cluster* for distributed shuffle operations (codegroupByKey/code, codereduceByKey/code, etc) when not set by user. Ognen would that have clarified

Re: spark.default.parallelism

2014-01-21 Thread Andrew Ash
https://github.com/apache/incubator-spark/pull/489 On Tue, Jan 21, 2014 at 3:41 PM, Ognen Duzlevski og...@plainvanillagames.com wrote: On Tue, Jan 21, 2014 at 10:37 PM, Andrew Ash and...@andrewash.com wrote: Documentation suggestion: Default number of tasks to use *across the cluster* for

Re: OOM - Help Optimizing Local Job

2014-01-21 Thread Brad Ruderman
Thanks Nick. I really appreciate your help. I understand the iterator and have modified the code to return the iterator. What I am noticing is that it returns: import collection.mutable.HashMap import collection.mutable.ListBuffer def getArray(line: String):List[Int] = { var a =

Re: reading LZO compressed file in spark

2014-01-21 Thread Vipul Pandey
Hi Rajeev, Did you get past this exception? Thanks, Vipul On Dec 26, 2013, at 12:48 PM, Rajeev Srivastava raj...@silverline-da.com wrote: Hi Andrew, Thanks for your example I used your command and i get the following errors from worker ( missing codec from worker i guess) How do

Re: reading LZO compressed file in spark

2014-01-21 Thread Rajeev Srivastava
Hi Vipul, Andrew Ash suggested the answer which i am yet to try. apparently his experiment worked for his LZO files. I don't think i will be able to try his suggestions before Feb Do share if his solution works for you. regards Rajeev Rajeev Srivastava Silverline Design Inc 2118 Walsh ave,