I dont think there is a clean way to do that. Its best to create separate
RDDs yourself.
TD
On Sat, Feb 22, 2014 at 12:11 PM, Grega Kešpret gr...@celtra.com wrote:
Is it possible to get location (e.g. file name) of RDD partition?
Let's say I do
val logs = sc.textFile(s3n://some/path/*/*)
at 4:48 PM, Fabrizio Milo aka misto
mistob...@gmail.com wrote:
I am using the latest from github compiled locally
On Sat, Feb 22, 2014 at 3:22 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Which version of Spark are you using?
TD
On Sat, Feb 22, 2014 at 3:15 PM, Fabrizio
1. I dont think we have tested window sizes that long.
2. If you have to keep track of a days worth of data, it may be better to
use an external systems that are more dedicated for lookups over massive
amounts of data (say, Cassandra). Use some unique key to push all the data
to Cassandra and
1. Yes, Spark Streaming can process many datastreams in parallel, specially
if all the datastreams are of the same type and all the streams get merged
and processed together through the same operations. You should be able to
use
As Jeremy said, the Spark Streaming has no python API yet. However, there
are a number of things you can do that allows you to do your main data
manipulation in Python. Spark API allows the data of a dataset to be
piped out to any arbitrary external script (say, a Bash script, or a
Python script).
The reason we chose to define windows based on time because of our
underlying system design of Spark Streaming. Spark Streaming essentially
divides received data in batches of fixed time interval and then runs Spark
job on that data. So the system naturally maintains a mapping of time
interval to
To add to the discussion, Spark Streaming's text file stream, automatically
detects new files and generates RDD out of them. For example, if you run 10
seconds batches, then all new files (of the same format) generated in the
directory every interval will be read and made into per-interval RDDs.
This is a little confusing. Lets try to confirm the following first. In the
Spark application's web ui, can you find the the stage (one of the first
few) that has only 1 task and has the name XYZ at NetworkInputTracker . In
that can you see where the single task is running? Is it in node-005, or
It could be that the worker receiving the data was undergoing GC and so
could actually receive any data. Can you check the web ui for the
application to see GC times of the corresponding stages?
TD
On Thu, Feb 20, 2014 at 12:03 PM, Mayur Rustagi mayur.rust...@gmail.comwrote:
is fresh data
, 2014 at 3:45 PM, Adrian Mocanu amoc...@verticalscope.com
wrote:
I have a question on the following paper
Discretized Streams: Fault-Tolerant Streaming Computation at Scale
written by
Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker,
Ion Stoica
and available
in this case bc there is no replication? Won't, in this case,
Spark use the power of its RDDs?
Thanks again
A
*From:* Tathagata Das [mailto:tathagata.das1...@gmail.com]
*Sent:* February-14-14 8:15 PM
*To:* user@spark.incubator.apache.org
*Subject:* Re: checkpoint and not running out
You have to import StreamingContext._
That imports implicit conversions that allow reduceByKey() to be applied on
DStreams with key-value pairs.
TD
On Tue, Feb 18, 2014 at 12:03 PM, bethesda swearinge...@mac.com wrote:
I am getting this error when trying to code from the following page in
The default zeromq receiver that comes with the Spark repository does
guarantee which machine the zeromq receiver will be launched, it can be on
any of the worker machines in the cluster, NOT the application machine
(called the driver in our terms). And your understanding of the code and
the
It could be that the hostname that Spark uses to identify the node is
different from the one you are providing. Are you using the Spark
standalone mode? In that case, you can check out the hostnames that Spark
is seeing and use that name.
Let me know if that works out.
TD
On Mon, Feb 17, 2014
checkpointing? how?
On Tue, Feb 18, 2014 at 5:44 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
A3: The basic RDD model is that the dataset is immutable. As new batches
of data come in, each batch is treat as a RDD. Then RDD transformations are
applied to create new RDDs. When some
duration, but is it possible to set this to let's say
an hour without changing slide duration ?
Thanks
Pankaj
On Fri, Feb 14, 2014 at 7:50 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Answers inline. Hope these answer your questions.
TD
On Thu, Feb 13, 2014 at 5:49 PM, Sourav
Hello Adrian,
A1: There is a significant difference between cache and checkpoint. Cache
materializes the RDD and keeps it in memory and / or disk. But the lineage
of RDD (that is, seq of operations that generated the RDD) will be
remembered, so that if there are node failures and parts of the
regards,
- Guanhua
From: Tathagata Das tathagata.das1...@gmail.com
Reply-To: user@spark.incubator.apache.org
Date: Thu, 13 Feb 2014 17:39:53 -0800
To: user@spark.incubator.apache.org
Subject: Re: Cluster launch
I am not entirely sure if that was the intended configuration for the
scripts
If you are using a sbt project file to link to spark streaming, then it is
actually simpler. Here is an example sbt file that links to Spark Streaming
and Spark Streaming's twitter functionality (for Spark 0.9).
https://github.com/amplab/training/blob/ampcamp4/streaming/scala/build.sbt
Instead of
You could do couple of things.
1. You can explicitly call streamingContext.stop() when the first iteration
is over. To detect whether the first iteration is over, you can use the
It should work when the property is set BEFORE creating the
StreamingContext. Or if you explicitly creating a SparkContext and then
creating a StreamingContext with the SparkContext, then the configuration
must be set BFEORE the SparkContext is created. With 0.9, you can also use
the SparkConf
Launching your application in a cluster may be useful in a number of
scenarios.
1) In a number of settings in companies, user who want to run jobs do not
have ssh access to any of the cluster nodes. So they have to run the Spark
driver program on their local machine and connect to the Spark
You could use sbin/start-slave.sh on the slave machine to launch the slave.
That should use the local SPARK_HOME on the slave machine to launch the
worker correctly.
TD
On Thu, Feb 13, 2014 at 1:09 PM, Guanhua Yan gh...@lanl.gov wrote:
Hi all:
I was trying to run sbin/start-master.sh and
Answers inline. Hope these answer your questions.
TD
On Thu, Feb 13, 2014 at 5:49 PM, Sourav Chandra
sourav.chan...@livestream.com wrote:
HI,
I have couple of questions:
1. While going through the spark-streaming code, I found out there is one
configuration in JobScheduler/Generator
On Fri, Feb 14, 2014 at 7:50 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Answers inline. Hope these answer your questions.
TD
On Thu, Feb 13, 2014 at 5:49 PM, Sourav Chandra
sourav.chan...@livestream.com wrote:
HI,
I have couple of questions:
1. While going through
awaitTermination() was added in Spark 0.9. Are you trying to run the
HdfsWordCount example, maybe in your own separate project? Make sure you
are compiling with Spark 0.9 and not anything older.
TD
On Mon, Feb 10, 2014 at 6:50 AM, Kal El pinu.datri...@yahoo.com wrote:
I am trying to run a
Somehow the scala compiler is not able to infer the types from _ + _
Try
reduceByKeyAndWindow((x: Int, y: Int) = x + y, Seconds(30), Seconds(10))
TD
2014-02-06 Eduardo Costa Alfaia e.costaalf...@unibs.it:
Hi Guys,
I am getting this error when I compile NetworkWordCount.scala:
nfo]
Using local[4] runs everything in local mode within a single JVM. So you
are expected to get only one connection when using a static variable.
TD
On Thu, Feb 6, 2014 at 5:01 AM, aecc alessandroa...@gmail.com wrote:
I forgot to mention that I'm running Spark Streaming
--
View this message
Does a sbt/sbt clean help? If it doesnt and the problem occurs
repeatedely, can you tell us what is the sequence of commands you are using
(from a clean github clone) so that we can reproduce the problem?
TD
On Thu, Feb 6, 2014 at 6:04 AM, zgalic zdravko.ga...@fer.hr wrote:
Hi Spark users,
= + (count / *batchInterval*) +
records / second) ?
On Wed, Feb 5, 2014 at 2:05 PM, Sourav Chandra
sourav.chan...@livestream.com wrote:
Hi Tathagata,
How can i find the batch size?
Thanks,
Sourav
On Wed, Feb 5, 2014 at 2:02 PM, Tathagata Das
tathagata.das1...@gmail.com wrote
Seems like it is not able to find a particular class -
org.apache.spark.metrics.sink.MetricsServlet .
How are you running your program? Is this an intermittent error? Does it go
away if you do a clean compilation of your project and run again?
TD
On Tue, Feb 4, 2014 at 9:22 AM, soojin
Hi Sourav,
For number of records received per second, you could use something like
this to calculate number of records in each batch, and divide it by your
batch size.
yourKafkaStream.foreachRDD(rdd = {
val count = rdd.count
println(Current rate = + (count / batchSize) + records / second)
Responses inline.
On Mon, Feb 3, 2014 at 11:03 AM, Liam Stewart liam.stew...@gmail.comwrote:
I'm looking at adding spark / shark to our analytics pipeline and would
also like to use spark streaming for some incremental computations, but I
have some questions about the suitability of spark
.)
Thanks for any help.
Best regards,
Sampo N.
On Fri, Jan 31, 2014 at 5:34 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
I walked through the example in the second link you gave. The Treasury
Yield example referred there is
herehttps://github.com/mongodb/mongo-hadoop/blob/master
I walked through the example in the second link you gave. The Treasury
Yield example referred there is
herehttps://github.com/mongodb/mongo-hadoop/blob/master/examples/treasury_yield/src/main/java/com/mongodb/hadoop/examples/treasury/TreasuryYieldXMLConfigV2.java.
Note the InputFormat and
Error 1) I dont think one should be worried about this error. This just
means that it has found two instance of the SLF4j library, one from each of
JAR. And both instances are probably same versions of the SLF4j library
(both come from Spark). So this is not really an error. Its just an
annoying
That depends. By default, the tasks are launched with location preference.
So if there is not free slot currently available on Node 1, Spark will wait
for a free slot. However if enable delay scheduler (see config property
spark.locality.wait), then it may launch tasks on other machines with free
You can enable fair sharing of resources between jobs in Spark. See
http://spark.incubator.apache.org/docs/latest/job-scheduling.html
On Sun, Jan 26, 2014 at 8:25 PM, Sai Prasanna ansaiprasa...@gmail.comwrote:
Please someone throw some light into this.
In lazy scheduling that spark had
On this note, you can do something smarter that the basic lookup function.
You could convert each partition of the key-value pair RDD into a hashmap
using something like
val rddOfHashmaps = pairRDD.mapPartitions(iterator = {
val hashmap = new HashMap[String, ArrayBuffer[Double]]
Master and Worker are components of the Spark's standalone cluster manager,
which manages the available resources in a cluster and divides them between
different Spark applications.
A spark application's Driver asks the Master for resources. Master
allocates certain Workers to the application.
is not accessible outside the Partitions, since
when I save to file it is empty. Could you please clarify how to use
mapPartitions()?
Thanks,
Brad
On Mon, Jan 20, 2014 at 6:41 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Well, right now it is quite parallelized as each line
Hi Eduardo,
You can do arbitrary stuff with the data in a DStream using the operation
foreachRDD.
yourDStream.foreachRDD(rdd = {
// Get and print first n elements
val firstN = rdd.take(n)
println(First N elements = + firstN)
// Count the number of elements in each batch
Hi Hussam,
Have you (1) generated Spark jar using sbt/sbt assembl, (2) distributed the
Spark jar to the worker machines? It could be that the system expects that
Spark jar to be present in /opt/spark-0.8.0/conf:/opt/
spark-0.8.0/assembly/target/scala-2.9.3/spark-assembly_2.
worker node(s) ?
On Sat, Jan 18, 2014 at 10:56 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Yes, RDD actions can be called only in the driver program, therefore only
in the driver node. However, they can be parallelized within the driver
program by calling multiple actions from multiple
think it would. I am running a MB Pro
Retina i7 + 16gb Ram). I am worried that the using ListBuffer or HashMap
will significantly slow things down and there are better ways to do this.
Thanks,
Brad
On Sat, Jan 18, 2014 at 7:11 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Hello
Hello Brad,
The shuffle operation seems to be taking too much memory, more than what
your Java program can provide. I am not sure whether you have already tried
or not, but there are few basic things you can try.
1. If you are running a local standalone Spark cluster, you can set the
amount of
Spark was built using the standard Hadoop libraries of InputFormat and
OutputFormat, so any InputFormat and OutputFormat should ideally be
supported. Besides the simplified interfaces for text files
(sparkContext.textFile(...)
) and seq file (sparkContext.sequenceFile(...) ), you can specify your
Could it be possible that you have an older version of JavaSparkContext
(i.e. from an older version of Spark) in your path? Please check that there
aren't two versions of Spark accidentally included in your class path used
in Eclipse. It would not give errors in the import (as it finds the
If you are running a distributed Spark cluster over the nodes, then the
reading should be done in a distributed manner. If you give sc.textFile() a
local path to a directory in the shared file system, then each worker
should read a subset of the files in directory by accessing them locally.
If you have been able to run Spark Pi to run on YARN, then you should be
able to run the streaming example
HdfsWordCounthttps://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/HdfsWordCount.scala
as
well. Even though the instructions in the
Just to add to Christopher's suggestion, do make sure that the
ScriptEngine.eval is thread-safe. If it is not, you can use
ThreadLocalhttp://docs.oracle.com/javase/7/docs/api/java/lang/ThreadLocal.htmlto
make sure there is one instance per execution thread.
TD
On Fri, Dec 27, 2013 at 8:12 PM,
51 matches
Mail list logo