Re: Get RDD partition location

2014-02-22 Thread Tathagata Das
I dont think there is a clean way to do that. Its best to create separate RDDs yourself. TD On Sat, Feb 22, 2014 at 12:11 PM, Grega Kešpret gr...@celtra.com wrote: Is it possible to get location (e.g. file name) of RDD partition? Let's say I do val logs = sc.textFile(s3n://some/path/*/*)

Re: How to sort an RDD ?

2014-02-22 Thread Tathagata Das
at 4:48 PM, Fabrizio Milo aka misto mistob...@gmail.com wrote: I am using the latest from github compiled locally On Sat, Feb 22, 2014 at 3:22 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Which version of Spark are you using? TD On Sat, Feb 22, 2014 at 3:15 PM, Fabrizio

Re: 1 day window size

2014-02-21 Thread Tathagata Das
1. I dont think we have tested window sizes that long. 2. If you have to keep track of a days worth of data, it may be better to use an external systems that are more dedicated for lookups over massive amounts of data (say, Cassandra). Use some unique key to push all the data to Cassandra and

Re: Getting started using spark for computer vision and video analytics

2014-02-21 Thread Tathagata Das
1. Yes, Spark Streaming can process many datastreams in parallel, specially if all the datastreams are of the same type and all the streams get merged and processed together through the same operations. You should be able to use

Re: Using PySpark for Streaming

2014-02-21 Thread Tathagata Das
As Jeremy said, the Spark Streaming has no python API yet. However, there are a number of things you can do that allows you to do your main data manipulation in Python. Spark API allows the data of a dataset to be piped out to any arbitrary external script (say, a Bash script, or a Python script).

Re: Spark Streaming windowing Driven by absolutely time?

2014-02-20 Thread Tathagata Das
The reason we chose to define windows based on time because of our underlying system design of Spark Streaming. Spark Streaming essentially divides received data in batches of fixed time interval and then runs Spark job on that data. So the system naturally maintains a mapping of time interval to

Re: Mutating RDD

2014-02-20 Thread Tathagata Das
To add to the discussion, Spark Streaming's text file stream, automatically detects new files and generates RDD out of them. For example, if you run 10 seconds batches, then all new files (of the same format) generated in the directory every interval will be read and made into per-interval RDDs.

Re: How to use FlumeInputDStream in spark cluster?

2014-02-20 Thread Tathagata Das
This is a little confusing. Lets try to confirm the following first. In the Spark application's web ui, can you find the the stage (one of the first few) that has only 1 task and has the name XYZ at NetworkInputTracker . In that can you see where the single task is running? Is it in node-005, or

Re: Explain About Logs NetworkWordcount.scala

2014-02-20 Thread Tathagata Das
It could be that the worker receiving the data was undergoing GC and so could actually receive any data. Can you check the web ui for the application to see GC times of the corresponding stages? TD On Thu, Feb 20, 2014 at 12:03 PM, Mayur Rustagi mayur.rust...@gmail.comwrote: is fresh data

Re: Q: Discretized Streams: Fault-Tolerant Streaming Computation paper

2014-02-20 Thread Tathagata Das
, 2014 at 3:45 PM, Adrian Mocanu amoc...@verticalscope.com wrote: I have a question on the following paper Discretized Streams: Fault-Tolerant Streaming Computation at Scale written by Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica and available

Re: checkpoint and not running out of disk space

2014-02-18 Thread Tathagata Das
in this case bc there is no replication? Won't, in this case, Spark use the power of its RDDs? Thanks again A *From:* Tathagata Das [mailto:tathagata.das1...@gmail.com] *Sent:* February-14-14 8:15 PM *To:* user@spark.incubator.apache.org *Subject:* Re: checkpoint and not running out

Re: reduceByKey() is not a member of org.apache.spark.streaming.dstream.DStream[(String, Int)]

2014-02-18 Thread Tathagata Das
You have to import StreamingContext._ That imports implicit conversions that allow reduceByKey() to be applied on DStreams with key-value pairs. TD On Tue, Feb 18, 2014 at 12:03 PM, bethesda swearinge...@mac.com wrote: I am getting this error when trying to code from the following page in

Re: Spark Streaming on a cluster

2014-02-18 Thread Tathagata Das
The default zeromq receiver that comes with the Spark repository does guarantee which machine the zeromq receiver will be launched, it can be on any of the worker machines in the cluster, NOT the application machine (called the driver in our terms). And your understanding of the code and the

Re: How to use FlumeInputDStream in spark cluster?

2014-02-18 Thread Tathagata Das
It could be that the hostname that Spark uses to identify the node is different from the one you are providing. Are you using the Spark standalone mode? In that case, you can check out the hostnames that Spark is seeing and use that name. Let me know if that works out. TD On Mon, Feb 17, 2014

Re: checkpoint and not running out of disk space

2014-02-18 Thread Tathagata Das
checkpointing? how? On Tue, Feb 18, 2014 at 5:44 PM, Tathagata Das tathagata.das1...@gmail.com wrote: A3: The basic RDD model is that the dataset is immutable. As new batches of data come in, each batch is treat as a RDD. Then RDD transformations are applied to create new RDDs. When some

Re: Spark streaming questions

2014-02-14 Thread Tathagata Das
duration, but is it possible to set this to let's say an hour without changing slide duration ? Thanks Pankaj On Fri, Feb 14, 2014 at 7:50 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Answers inline. Hope these answer your questions. TD On Thu, Feb 13, 2014 at 5:49 PM, Sourav

Re: checkpoint and not running out of disk space

2014-02-14 Thread Tathagata Das
Hello Adrian, A1: There is a significant difference between cache and checkpoint. Cache materializes the RDD and keeps it in memory and / or disk. But the lineage of RDD (that is, seq of operations that generated the RDD) will be remembered, so that if there are node failures and parts of the

Re: Cluster launch

2014-02-14 Thread Tathagata Das
regards, - Guanhua From: Tathagata Das tathagata.das1...@gmail.com Reply-To: user@spark.incubator.apache.org Date: Thu, 13 Feb 2014 17:39:53 -0800 To: user@spark.incubator.apache.org Subject: Re: Cluster launch I am not entirely sure if that was the intended configuration for the scripts

Re: Errors occurred while compiling module 'spark-streaming-zeromq' (IntelliJ IDEA 13.0.2)

2014-02-13 Thread Tathagata Das
If you are using a sbt project file to link to spark streaming, then it is actually simpler. Here is an example sbt file that links to Spark Streaming and Spark Streaming's twitter functionality (for Spark 0.9). https://github.com/amplab/training/blob/ampcamp4/streaming/scala/build.sbt Instead of

Re: running Spark Streaming just once and stop it

2014-02-13 Thread Tathagata Das
You could do couple of things. 1. You can explicitly call streamingContext.stop() when the first iteration is over. To detect whether the first iteration is over, you can use the

Re: spark streaming job - set java memory

2014-02-13 Thread Tathagata Das
It should work when the property is set BEFORE creating the StreamingContext. Or if you explicitly creating a SparkContext and then creating a StreamingContext with the SparkContext, then the configuration must be set BFEORE the SparkContext is created. With 0.9, you can also use the SparkConf

Re: Connecting App to cluster VS Launching app within cluster

2014-02-13 Thread Tathagata Das
Launching your application in a cluster may be useful in a number of scenarios. 1) In a number of settings in companies, user who want to run jobs do not have ssh access to any of the cluster nodes. So they have to run the Spark driver program on their local machine and connect to the Spark

Re: Cluster launch

2014-02-13 Thread Tathagata Das
You could use sbin/start-slave.sh on the slave machine to launch the slave. That should use the local SPARK_HOME on the slave machine to launch the worker correctly. TD On Thu, Feb 13, 2014 at 1:09 PM, Guanhua Yan gh...@lanl.gov wrote: Hi all: I was trying to run sbin/start-master.sh and

Re: Spark streaming questions

2014-02-13 Thread Tathagata Das
Answers inline. Hope these answer your questions. TD On Thu, Feb 13, 2014 at 5:49 PM, Sourav Chandra sourav.chan...@livestream.com wrote: HI, I have couple of questions: 1. While going through the spark-streaming code, I found out there is one configuration in JobScheduler/Generator

Re: Spark streaming questions

2014-02-13 Thread Tathagata Das
On Fri, Feb 14, 2014 at 7:50 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Answers inline. Hope these answer your questions. TD On Thu, Feb 13, 2014 at 5:49 PM, Sourav Chandra sourav.chan...@livestream.com wrote: HI, I have couple of questions: 1. While going through

Re: cannot find awaitTermination()

2014-02-10 Thread Tathagata Das
awaitTermination() was added in Spark 0.9. Are you trying to run the HdfsWordCount example, maybe in your own separate project? Make sure you are compiling with Spark 0.9 and not anything older. TD On Mon, Feb 10, 2014 at 6:50 AM, Kal El pinu.datri...@yahoo.com wrote: I am trying to run a

Re: Compiling NetworkWordCount scala code

2014-02-07 Thread Tathagata Das
Somehow the scala compiler is not able to infer the types from _ + _ Try reduceByKeyAndWindow((x: Int, y: Int) = x + y, Seconds(30), Seconds(10)) TD 2014-02-06 Eduardo Costa Alfaia e.costaalf...@unibs.it: Hi Guys, I am getting this error when I compile NetworkWordCount.scala: nfo]

Re: Database connection per worker

2014-02-06 Thread Tathagata Das
Using local[4] runs everything in local mode within a single JVM. So you are expected to get only one connection when using a static variable. TD On Thu, Feb 6, 2014 at 5:01 AM, aecc alessandroa...@gmail.com wrote: I forgot to mention that I'm running Spark Streaming -- View this message

Re: Errors occurred while compiling module 'spark-streaming-zeromq' (IntelliJ IDEA 13.0.2)

2014-02-06 Thread Tathagata Das
Does a sbt/sbt clean help? If it doesnt and the problem occurs repeatedely, can you tell us what is the sequence of commands you are using (from a clean github clone) so that we can reproduce the problem? TD On Thu, Feb 6, 2014 at 6:04 AM, zgalic zdravko.ga...@fer.hr wrote: Hi Spark users,

Re: Message processing rate of spark

2014-02-06 Thread Tathagata Das
= + (count / *batchInterval*) + records / second) ? On Wed, Feb 5, 2014 at 2:05 PM, Sourav Chandra sourav.chan...@livestream.com wrote: Hi Tathagata, How can i find the batch size? Thanks, Sourav On Wed, Feb 5, 2014 at 2:02 PM, Tathagata Das tathagata.das1...@gmail.com wrote

Re: Spark Streaming StreamingContext error

2014-02-05 Thread Tathagata Das
Seems like it is not able to find a particular class - org.apache.spark.metrics.sink.MetricsServlet . How are you running your program? Is this an intermittent error? Does it go away if you do a clean compilation of your project and run again? TD On Tue, Feb 4, 2014 at 9:22 AM, soojin

Re: Message processing rate of spark

2014-02-05 Thread Tathagata Das
Hi Sourav, For number of records received per second, you could use something like this to calculate number of records in each batch, and divide it by your batch size. yourKafkaStream.foreachRDD(rdd = { val count = rdd.count println(Current rate = + (count / batchSize) + records / second)

Re: spark streaming questions

2014-02-05 Thread Tathagata Das
Responses inline. On Mon, Feb 3, 2014 at 11:03 AM, Liam Stewart liam.stew...@gmail.comwrote: I'm looking at adding spark / shark to our analytics pipeline and would also like to use spark streaming for some incremental computations, but I have some questions about the suitability of spark

Re: Spark + MongoDB

2014-02-04 Thread Tathagata Das
.) Thanks for any help. Best regards, Sampo N. On Fri, Jan 31, 2014 at 5:34 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I walked through the example in the second link you gave. The Treasury Yield example referred there is herehttps://github.com/mongodb/mongo-hadoop/blob/master

Re: Spark + MongoDB

2014-01-30 Thread Tathagata Das
I walked through the example in the second link you gave. The Treasury Yield example referred there is herehttps://github.com/mongodb/mongo-hadoop/blob/master/examples/treasury_yield/src/main/java/com/mongodb/hadoop/examples/treasury/TreasuryYieldXMLConfigV2.java. Note the InputFormat and

Re: SLF4J error and log4j error

2014-01-30 Thread Tathagata Das
Error 1) I dont think one should be worried about this error. This just means that it has found two instance of the SLF4j library, one from each of JAR. And both instances are probably same versions of the SLF4j library (both come from Spark). So this is not really an error. Its just an annoying

Re: Distributed Shared Access of Cached RDD's

2014-01-30 Thread Tathagata Das
That depends. By default, the tasks are launched with location preference. So if there is not free slot currently available on Node 1, Spark will wait for a free slot. However if enable delay scheduler (see config property spark.locality.wait), then it may launch tasks on other machines with free

Re: Lazy Scheduling

2014-01-30 Thread Tathagata Das
You can enable fair sharing of resources between jobs in Spark. See http://spark.incubator.apache.org/docs/latest/job-scheduling.html On Sun, Jan 26, 2014 at 8:25 PM, Sai Prasanna ansaiprasa...@gmail.comwrote: Please someone throw some light into this. In lazy scheduling that spark had

Re: How to create RDD over hashmap?

2014-01-24 Thread Tathagata Das
On this note, you can do something smarter that the basic lookup function. You could convert each partition of the key-value pair RDD into a hashmap using something like val rddOfHashmaps = pairRDD.mapPartitions(iterator = { val hashmap = new HashMap[String, ArrayBuffer[Double]]

Re: Division of work between master, worker, executor and driver

2014-01-24 Thread Tathagata Das
Master and Worker are components of the Spark's standalone cluster manager, which manages the available resources in a cluster and divides them between different Spark applications. A spark application's Driver asks the Master for resources. Master allocates certain Workers to the application.

Re: OOM - Help Optimizing Local Job

2014-01-22 Thread Tathagata Das
is not accessible outside the Partitions, since when I save to file it is empty. Could you please clarify how to use mapPartitions()? Thanks, Brad On Mon, Jan 20, 2014 at 6:41 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Well, right now it is quite parallelized as each line

Re: Print in JavaNetworkWordCount

2014-01-20 Thread Tathagata Das
Hi Eduardo, You can do arbitrary stuff with the data in a DStream using the operation foreachRDD. yourDStream.foreachRDD(rdd = { // Get and print first n elements val firstN = rdd.take(n) println(First N elements = + firstN) // Count the number of elements in each batch

Re: Error: Could not find or load main class org.apache.spark.executor.CoarseGrainedExecutorBackend

2014-01-20 Thread Tathagata Das
Hi Hussam, Have you (1) generated Spark jar using sbt/sbt assembl, (2) distributed the Spark jar to the worker machines? It could be that the system expects that Spark jar to be present in /opt/spark-0.8.0/conf:/opt/ spark-0.8.0/assembly/target/scala-2.9.3/spark-assembly_2.

Re: Do RDD actions run only on driver ?

2014-01-19 Thread Tathagata Das
worker node(s) ? On Sat, Jan 18, 2014 at 10:56 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Yes, RDD actions can be called only in the driver program, therefore only in the driver node. However, they can be parallelized within the driver program by calling multiple actions from multiple

Re: OOM - Help Optimizing Local Job

2014-01-19 Thread Tathagata Das
think it would. I am running a MB Pro Retina i7 + 16gb Ram). I am worried that the using ListBuffer or HashMap will significantly slow things down and there are better ways to do this. Thanks, Brad On Sat, Jan 18, 2014 at 7:11 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Hello

Re: OOM - Help Optimizing Local Job

2014-01-18 Thread Tathagata Das
Hello Brad, The shuffle operation seems to be taking too much memory, more than what your Java program can provide. I am not sure whether you have already tried or not, but there are few basic things you can try. 1. If you are running a local standalone Spark cluster, you can set the amount of

Re: Which of the hadoop file formats are supported by Spark ?

2014-01-18 Thread Tathagata Das
Spark was built using the standard Hadoop libraries of InputFormat and OutputFormat, so any InputFormat and OutputFormat should ideally be supported. Besides the simplified interfaces for text files (sparkContext.textFile(...) ) and seq file (sparkContext.sequenceFile(...) ), you can specify your

Re: jarOfClass method no found in SparkContext

2014-01-15 Thread Tathagata Das
Could it be possible that you have an older version of JavaSparkContext (i.e. from an older version of Spark) in your path? Please check that there aren't two versions of Spark accidentally included in your class path used in Eclipse. It would not give errors in the import (as it finds the

Re: Reading files on a cluster / shared file system

2014-01-15 Thread Tathagata Das
If you are running a distributed Spark cluster over the nodes, then the reading should be done in a distributed manner. If you give sc.textFile() a local path to a directory in the shared file system, then each worker should read a subset of the files in directory by accessing them locally.

Re: Spark streaming on YARN?

2014-01-09 Thread Tathagata Das
If you have been able to run Spark Pi to run on YARN, then you should be able to run the streaming example HdfsWordCounthttps://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/HdfsWordCount.scala as well. Even though the instructions in the

Re: Stateful RDD

2013-12-27 Thread Tathagata Das
Just to add to Christopher's suggestion, do make sure that the ScriptEngine.eval is thread-safe. If it is not, you can use ThreadLocalhttp://docs.oracle.com/javase/7/docs/api/java/lang/ThreadLocal.htmlto make sure there is one instance per execution thread. TD On Fri, Dec 27, 2013 at 8:12 PM,