Hi,
I have the following use case.
(1) I have an RDD of edges of a graph (say R).
(2) do a groupBy on R (by say source vertex) and call a function F on each
group.
(3) collect the results from Fs and do some computation
(4) repeat the above steps until some criteria is met
In (2), the groups
val grouped = R.groupBy[VertexId](G).persist(StorageLeve.MEMORY_ONLY_SER)
// or whatever persistence makes more sense for you ...
while(true) {
val res = grouped.flatMap(F)
res.collect.foreach(func)
if(criteria)
break
}
On Thu, Feb 26, 2015 at 10:56 AM, Vijayasarathy Kannan
I have this function in the driver program which collects the result from
rdds (in a stream) into an array and return. However, even though the RDDs
(in the dstream) have data, the function is returning an empty array...What
am I doing wrong?
I can print the RDD values inside the foreachRDD call
Hi,
On Thu, Feb 26, 2015 at 11:24 AM, Thanigai Vellore
thanigai.vell...@gmail.com wrote:
It appears that the function immediately returns even before the
foreachrdd stage is executed. Is that possible?
Sure, that's exactly what happens. foreachRDD() schedules a computation, it
does not
thanigai.vell...@gmail.com wrote:
I have this function in the driver program which collects the result from
rdds (in a stream) into an array and return. However, even though the RDDs
(in the dstream) have data, the function is returning an empty array...What
am I doing wrong?
I can print
of
disk.
So, in case someone else notices a behavior like this, make sure you check
your cluster monitor (like ganglia).
On Wed, Jan 28, 2015 at 5:40 PM, Thomas Gerber thomas.ger...@radius.com
wrote:
Hello,
I am storing RDDs with the MEMORY_ONLY_SER Storage Level, during the run
of a big job
Hi,
I'm writing a Spark program where I want to divide a RDD into different
groups, but the groups are too big to use groupByKey. To cope with that,
since I know in advance the list of keys for each group, I build a map from
the keys to the RDDs that result from filtering the input RDD to get
Hi all,
I am trying to create RDDs from within /rdd.foreachPartition()/ so I can
save these RDDs to ElasticSearch on the fly :
stream.foreachRDD(rdd = {
rdd.foreachPartition {
iterator = {
val sc = rdd.context
iterator.foreach {
case (cid
Hi Sean,
Thanks a lot for your answer. That explains it, as I was creating thousands
of RDDs, so I guess the communication overhead was the reason why the Spark
job was freezing. After changing the code to use RDDs of pairs and
aggregateByKey it works just fine, and quite fast.
Again, thanks
At some level, enough RDDs creates hundreds of thousands of tiny
partitions of data each of which creates a task for each stage. The
raw overhead of all the message passing can slow things down a lot. I
would not design something to use an RDD per key. You would generally
use key by some value you
You can't use RDDs inside RDDs. RDDs are managed from the driver, and
functions like foreachRDD execute things on the remote executors.
You can write code to simply directly save whatever you want to ES.
There is not necessarily a need to use RDDs for that.
On Wed, Feb 18, 2015 at 11:36 AM, t1ny
Hi Paweł,
Thanks a lot for your answer. I finally got the program to work by using
aggregateByKey, but I was wondering why creating thousands of RDDs doesn't
work. I think that could be interesting for using methods that work on RDDs
like for example JavaDoubleRDD.stats() (
http
juan.rodriguez.hort...@gmail.com wrote:
Hi,
I'm writing a Spark program where I want to divide a RDD into different
groups, but the groups are too big to use groupByKey. To cope with that,
since I know in advance the list of keys for each group, I build a map from
the keys to the RDDs that result from
by
union-ing the two transformed rdds, which is very different from the way it
works under the hood in scala to enable narrow dependencies. It really
needs a bigger change to pyspark. I filed this issue:
https://issues.apache.org/jira/browse/SPARK-5785
(and the somewhat related issue about
at rdd.dependencies. its a little
tricky,
though, you need to look deep into the lineage (example at the end).
Sean -- yes it does require both RDDs have the same partitioner, but
that
should happen naturally if you just specify the same number of
partitions,
you'll get equal HashPartitioners
are hidden deeper inside and are not user-visible. But I think a
better way
(in scala, anyway) is to look at rdd.dependencies. its a little
tricky,
though, you need to look deep into the lineage (example at the end).
Sean -- yes it does require both RDDs have the same partitioner, but
that
should
Hi all,
I'd like to build/use column oriented RDDs in some of my Spark code. A
normal Spark RDD is stored as row oriented object if I understand
correctly.
I'd like to leverage some of the advantages of a columnar memory format.
Shark (used to) and SparkSQL uses a columnar storage format using
yeah I thought the same thing at first too, I suggested something
equivalent w/ preservesPartitioning = true, but that isn't enough. the
join is done by union-ing the two transformed rdds, which is very different
from the way it works under the hood in scala to enable narrow
dependencies
to build/use column oriented RDDs in some of my Spark code. A
normal Spark RDD is stored as row oriented object if I understand
correctly.
I'd like to leverage some of the advantages of a columnar memory format.
Shark (used to) and SparkSQL uses a columnar storage format using primitive
all,
I'd like to build/use column oriented RDDs in some of my Spark code. A
normal Spark RDD is stored as row oriented object if I understand
correctly.
I'd like to leverage some of the advantages of a columnar memory format.
Shark (used to) and SparkSQL uses a columnar storage format using
Hi,
I believe that partitionBy will use the same (default) partitioner on
both RDDs.
On 2015-02-12 17:12, Sean Owen wrote:
Doesn't this require that both RDDs have the same partitioner?
On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid iras...@cloudera.com
wrote:
Hi Karlson,
I think your
Hi All,
using Pyspark, I create two RDDs (one with about 2M records (~200MB),
the other with about 8M records (~2GB)) of the format (key, value).
I've done a partitionBy(num_partitions) on both RDDs and verified that
both RDDs have the same number of partitions and that equal keys reside
, anyway) is to look at rdd.dependencies. its a little
tricky, though, you need to look deep into the lineage (example at the end).
Sean -- yes it does require both RDDs have the same partitioner, but that
should happen naturally if you just specify the same number of partitions,
you'll get
Hi Karlson,
I think your assumptions are correct -- that join alone shouldn't require
any shuffling. But its possible you are getting tripped up by lazy
evaluation of RDDs. After you do your partitionBy, are you sure those RDDs
are actually materialized cached somewhere? eg., if you just did
Doesn't this require that both RDDs have the same partitioner?
On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid iras...@cloudera.com wrote:
Hi Karlson,
I think your assumptions are correct -- that join alone shouldn't require
any shuffling. But its possible you are getting tripped up by lazy
, anyway) is to look at rdd.dependencies. its a little
tricky, though, you need to look deep into the lineage (example at the end).
Sean -- yes it does require both RDDs have the same partitioner, but that
should happen naturally if you just specify the same number of partitions,
you'll get equal
)
}
On 10.01.2015 06:56, Xiangrui Meng wrote:
sample 2 * n tuples, split them into two parts, balance the sizes of
these parts by filtering some tuples out
How do you guarantee that the two RDDs have the same size?
-Xiangrui
On Fri, Jan 9, 2015 at 3:40 AM, Niklas Wilcke
1wil
Hey Spark gurus! Sorry for the confusing title. I do not know the exactly
description of my problem, if you know please tell me so I can change it :-)
Say I have two RDDs right now, and they are
val rdd1 = sc.parallelize(List((1,(3)), (2,(5)), (3,(6
val rdd2 = sc.parallelize(List((2,(1)), (2
(println)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/New-combination-like-RDD-based-on-two-RDDs-tp21508p21511.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
solution we have is to format the data into the pipe, so it
includes the ids, and then restore it all in a map after the pipe.
However it would be much nicer if we could just join/zip back the output of
the pipe. However we can’t cache the RDDs, so it would be nice to have a
forkRDD of some
Hi,
I am trying to find where Spark persists RDDs when we call the persist()
api and executed under YARN. This is purely for understanding...
In my driver program, I wait indefinitely, so as to avoid any clean up
problems.
In the actual job, I roughly do the following:
JavaRDDString lines
RDDs when we call the persist() api
and executed under YARN. This is purely for understanding...
In my driver program, I wait indefinitely, so as to avoid any clean up
problems.
In the actual job, I roughly do the following:
JavaRDDString lines = context.textFile(args[0]);
lines.persist
sample 2 * n tuples, split them into two parts, balance the sizes of
these parts by filtering some tuples out
How do you guarantee that the two RDDs have the same size?
-Xiangrui
On Fri, Jan 9, 2015 at 3:40 AM, Niklas Wilcke
1wil...@informatik.uni-hamburg.de wrote:
Hi Spark community,
I have
Hi Spark community,
I have a problem with zipping two RDDs of the same size and same number
of partitions.
The error message says that zipping is only allowed on RDDs which are
partitioned into chunks of exactly the same sizes.
How can I assure this? My workaround at the moment is to repartition
to join non-DStream RDDs with DStream RDDs?
Here is the use case. I have a lookup table stored in HDFS that I want to
read as an RDD. Then I want to join it with the RDDs that are coming in
through the DStream. How can I do this?
Thanks.
Asim
Is there a way to join non-DStream RDDs with DStream RDDs?
Here is the use case. I have a lookup table stored in HDFS that I want to
read as an RDD. Then I want to join it with the RDDs that are coming in
through the DStream. How can I do this?
Thanks.
Asim
:
Is there a way to join non-DStream RDDs with DStream RDDs?
Here is the use case. I have a lookup table stored in HDFS that I want to
read as an RDD. Then I want to join it with the RDDs that are coming in
through the DStream. How can I do this?
Thanks.
Asim
You can also use join function of rdd. This is actually kind of append
funtion that add up all the rdds and create one uber rdd.
On Wed, Jan 7, 2015, 14:30 rkgurram rkgur...@gmail.com wrote:
Thank you for the response, sure will try that out.
Currently I changed my code such that the first
Thank you for the response, sure will try that out.
Currently I changed my code such that the first map files.map to
files.flatMap, which I guess will do similar what you are saying, it gives
me a List[] of elements (in this case LabeledPoints, I could also do RDDs)
which I then turned
funtion that add up all the rdds and create one uber rdd.
On Wed, Jan 7, 2015, 14:30 rkgurram rkgur...@gmail.com wrote:
Thank you for the response, sure will try that out.
Currently I changed my code such that the first map files.map to
files.flatMap, which I guess will do similar what you
We just updated to Spark 1.2.0 from Spark 1.1.0. We have a small framework
that we've been developing that connects various different RDDs together
based on some predefined business cases. After updating to 1.2.0, some of
the concurrency expectations about how the stages within jobs are executed
I asked this question too soon. I am caching off a bunch of RDDs in a
TrieMap so that our framework can wire them together and the locking was
not completely correct- therefore it was creating multiple new RDDs at
times instead of using cached versions- which were creating completely
separate
Hi Manoj,
I've noticed that the storage tab only shows RDDs that have been cached.
Did you call .cache() or .persist() on any of the RDDs?
Andrew
On Tue, Jan 6, 2015 at 6:48 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Hi,
I create a bunch of RDDs, including schema RDDs. When I run
an RDD cannot contain elements of type RDD. (i.e. you can't nest RDDs within
RDDs, in fact, I don't think it makes any sense)
I suggest rather than having an RDD of file names, collect those file name
strings back on to the driver as a Scala array of file names, and then from
there, make an array
Hi,
I create a bunch of RDDs, including schema RDDs. When I run the program and
go to UI on xxx:4040, the storage tab does not shows any RDDs.
Spark version is 1.1.1 (Hadoop 2.3)
Any thoughts?
Thanks,
$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
at scala.collection.immutable.List.foreach(List.scala:318)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Exception-after-changing-RDDs-tp20841.html
Sent
guarantees
about ordering. In practice, you may find that you still encounter the
elements in the same order after coalesce(1), although I am not sure
that is even true.
union() is the same story; unless the RDDs are sorted I don't think
there are guarantees. However I'm almost certain
() is the same story; unless the RDDs are sorted I don't think
there are guarantees. However I'm almost certain that in practice, as
it happens now, A's elements would come before B's after a union, if
you did traverse them.
On Fri, Dec 19, 2014 at 5:41 AM, madhu phatak phatak@gmail.com wrote:
Hi
.
Is this something I just have to live with when using the REPL? Or is this
covered by something bigger that's being addressed?
Thanks in advance
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-bug-with-RDDs-and-case-classes-tp20789.html
Sent from
just have to live with when using the REPL? Or is this
covered by something bigger that's being addressed?
Thanks in advance
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-bug-with-RDDs-and-case-classes-tp20789.html
Sent from the Apache
Hi Spark users,
I wonder if val resultRDD = RDDA.union(RDDB) will always have records in
RDDA before records in RDDB.
Also, will resultRDD.coalesce(1) change this ordering?
Best Regards,
Jerry
RDD.persist() can be useful here.
On 11 December 2014 at 14:34, ankits [via Apache Spark User List]
ml-node+s1001560n20613...@n3.nabble.com wrote:
I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too
fast. How can i inspect the size of RDD in memory and get more information
Aditya, I think you have the mental model of spark streaming a little
off the mark. Unlike traditional streaming systems, where any kind of
state is mutable, SparkStreaming is designed on Sparks immutable RDDs.
Streaming data is received and divided into immutable blocks, then
form immutable RDDs
I was having similar issues with my persistent RDDs. After some digging
around, I noticed that the partitions were not balanced evenly across the
available nodes. After a repartition, the RDD was spread evenly across
all available memory. Not sure if that is something that would help your
use-case
I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too fast.
How can i inspect the size of RDD in memory and get more information about
why it was cleaned up. There should be more than enough memory available on
the cluster to store them, and by default, the spark.cleaner.ttl
The ContextCleaner uncaches RDDs that have gone out of scope on the driver.
So it's possible that the given RDD is no longer reachable in your
program's control flow, or else it'd be a bug in the ContextCleaner.
On Wed, Dec 10, 2014 at 5:34 PM, ankits ankitso...@gmail.com wrote:
I'm using spark
If all RDD elements within a partition contain pointers to a single shared
object, Spark persists as expected when the RDD is small. However, if the
RDD is more than *200 elements* then Spark reports requiring much more
memory than it actually does. This becomes a problem for large RDDs, as
Spark
I am relatively new to Spark. I am planning to use Spark Streaming for my
OLAP use case, but I would like to know how RDDs are shared between multiple
workers.
If I need to constantly compute some stats on the streaming data, presumably
shared state would have to updated serially by different
relatively new to Spark. I am planning to use Spark Streaming for my
OLAP use case, but I would like to know how RDDs are shared between
multiple
workers.
If I need to constantly compute some stats on the streaming data,
presumably
shared state would have to updated serially by different spark
in the documentation that RDDs can be run concurrently when
submitted in separate threads. I'm curious how the scheduler would handle
propagating these down to the tasks.
I have 3 RDDs:
- one RDD which loads some initial data, transforms it and caches it
- two RDDs which use the cached RDD to provide
is calling a job. I
guess application (more than one spark context) is what I'm asking about
On Dec 5, 2014 5:19 PM, Corey Nolet cjno...@gmail.com wrote:
I've read in the documentation that RDDs can be run concurrently when
submitted in separate threads. I'm curious how the scheduler would handle
I've read in the documentation that RDDs can be run concurrently when
submitted in separate threads. I'm curious how the scheduler would handle
propagating these down to the tasks.
I have 3 RDDs:
- one RDD which loads some initial data, transforms it and caches it
- two RDDs which use the cached
Hi,
I have a graph and I want to create RDDs equal in number to the nodes in
the graph. How can I do that?
If I have 10 nodes then I want to create 10 rdds. Is that possible in
GraphX?
Like in C language we have array of pointers. Do we have array of RDDs in
Spark.
Can we create such an array
At 2014-12-04 02:08:45 -0800, Deep Pradhan pradhandeep1...@gmail.com wrote:
I have a graph and I want to create RDDs equal in number to the nodes in
the graph. How can I do that?
If I have 10 nodes then I want to create 10 rdds. Is that possible in
GraphX?
This is possible: you can collect
Regarding: Can we create such an array and then parallelize it?
Parallelizing an array of RDDs - i.e. RDD[RDD[x]] is not possible.
RDD is not serializable.
From: Deep Pradhan [mailto:pradhandeep1...@gmail.com]
Sent: 04 December 2014 15:39
To: user@spark.apache.org
Subject: Determination
This is a common use case and this is how Hadoop APIs for reading data
work, they return an Iterator [Your Record] instead of reading every record
in at once.
On Dec 1, 2014 9:43 PM, Andy Twigg andy.tw...@gmail.com wrote:
You may be able to construct RDDs directly from an iterator - not sure
file = tranform file into a bunch of records
What does this function do exactly? Does it load the file locally?
Spark supports RDDs exceeding global RAM (cf the terasort example), but if
your example just loads each file locally, then this may cause problems.
Instead, you should load each file
:
file = tranform file into a bunch of records
What does this function do exactly? Does it load the file locally?
Spark supports RDDs exceeding global RAM (cf the terasort example), but if
your example just loads each file locally, then this may cause problems.
Instead, you should load each file
Could you modify your function so that it streams through the files record
by record and outputs them to hdfs, then read them all in as RDDs and take
the union? That would only use bounded memory.
On 1 December 2014 at 17:19, Keith Simmons ke...@pulse.io wrote:
Actually, I'm working
through the files record
by record and outputs them to hdfs, then read them all in as RDDs and take
the union? That would only use bounded memory.
On 1 December 2014 at 17:19, Keith Simmons ke...@pulse.io wrote:
Actually, I'm working with a binary format. The api allows reading out a
single
You may be able to construct RDDs directly from an iterator - not sure
- you may have to subclass your own.
On 1 December 2014 at 18:40, Keith Simmons ke...@pulse.io wrote:
Yep, that's definitely possible. It's one of the workarounds I was
considering. I was just curious
wrote:
Hi,
I ran into a problem when doing two RDDs join operation. For example,
RDDa: RDD[(String,String)] and RDDb:RDD[(String,Int)]. Then, the result
RDDc:[String,(String,Int)] = RDDa.join(RDDb). But I find the results in
RDDc are incorrect compared with RDDb. What's wrong in join
Hi,
I ran into a problem when doing two RDDs join operation. For example,
RDDa: RDD[(String,String)] and RDDb:RDD[(String,Int)]. Then, the result
RDDc:[String,(String,Int)] = RDDa.join(RDDb). But I find the results in RDDc
are incorrect compared with RDDb. What's wrong in join?
--
View
Say I have two RDDs with the following values
x = [(1, 3), (2, 4)]
and
y = [(3, 5), (4, 7)]
and I want to have
z = [(1, 3), (2, 4), (3, 5), (4, 7)]
How can I achieve this. I know you can use outerJoin followed by map to
achieve this, but is there a more direct way for this.
You want to use RDD.union (or SparkContext.union for many RDDs). These
don't join on a key. Union doesn't really do anything itself, so it is low
overhead. Note that the combined RDD will have all the partitions of the
original RDDs, so you may want to coalesce after the union.
val x
from : file2.txt contains (ID, name)
val y: RDD[(Long, String)]{where ID is common in both the RDDs}
[(4407 ,Jhon),
(2064, Maria),
(7815 ,Casto),
(5736,Ram),
(8031,XYZ)]
and I'm expecting result should be like this : [(ID, Name, Count)]
[(4407 ,Jhon, 40),
(2064, Maria, 38),
(7815 ,Casto, 10
is common in both the RDDs}
[(4407 ,Jhon),
(2064, Maria),
(7815 ,Casto),
(5736,Ram),
(8031,XYZ)]
and I'm expecting result should be like this : [(ID, Name, Count)]
[(4407 ,Jhon, 40),
(2064, Maria, 38),
(7815 ,Casto, 10),
(5736,Ram, 17),
(8031,XYZ, 3)]
Any help will really
Harihar, your question is the opposite of what was asked. In the future,
please start a new thread for new questions.
You want to do a join in your case. The join function does an inner join,
which I think is what you want because you stated your IDs are common in
both RDDs. For other cases you
-RDDs-with-mutually-exclusive-keys-tp19417p19431.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h
Hi, all
Suppose I have a RDD of (K, V) tuple and I do groupBy on it with the key K.
My question is how to make each groupBy resukt whick is (K, iterable[V]) a RDD.
BTW, can we transform it as a DStream and also each groupBY result is a RDD in
it?
Best Regards,
Kevin.
What's your use case? You would not generally want to make so many small
RDDs.
On Nov 20, 2014 6:19 AM, Dai, Kevin yun...@ebay.com wrote:
Hi, all
Suppose I have a RDD of (K, V) tuple and I do groupBy on it with the key K.
My question is how to make each groupBy resukt whick is (K
Let's say I have to apply a complex sequence of operations to a certain RDD.
In order to make code more modular/readable, I would typically have
something like this:
object myObject {
def main(args: Array[String]) {
val rdd1 = function1(myRdd)
val rdd2 = function2(rdd1)
val rdd3 =
how about using fluent style of Scala programming.
On Fri, Nov 14, 2014 at 8:31 AM, Simone Franzini captainfr...@gmail.com
wrote:
Let's say I have to apply a complex sequence of operations to a certain
RDD.
In order to make code more modular/readable, I would typically have
something like
This code executes on the driver, and an RDD here is really just a
handle on all the distributed data out there. It's a local bookkeeping
object. So, manipulation of these objects themselves in the local
driver code has virtually no performance impact. These two versions
would be about identical*.
Let us say I have the following two RDDs, with the following key-pair
values.
rdd1 = [ (key1, [value1, value2]), (key2, [value3, value4]) ]
and
rdd2 = [ (key1, [value5, value6]), (key2, [value7]) ]
Now, I want to join them by key values, so for example I want to return the
following
Check cogroup.
Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Thu, Nov 13, 2014 at 5:11 PM, Blind Faith person.of.b...@gmail.com
wrote:
Let us say I have the following two RDDs, with the following key-pair
values.
rdd1
rdd1.union(rdd2).groupByKey()
On Thu, Nov 13, 2014 at 3:41 AM, Blind Faith person.of.b...@gmail.com wrote:
Let us say I have the following two RDDs, with the following key-pair
values.
rdd1 = [ (key1, [value1, value2]), (key2, [value3, value4]) ]
and
rdd2 = [ (key1, [value5
?
Regards,
Kapil
-Original Message-
From: akhandeshi [mailto:ami.khande...@gmail.com]
Sent: 12 November 2014 03:44
To: u...@spark.incubator.apache.org
Subject: Help with processing multiple RDDs
I have been struggling to process a set of RDDs. Conceptually, it is is not a
large data set
i think you can try to set lower spark.storage.memoryFraction,for example 0.4
conf.set(spark.storage.memoryFraction,0.4) //default 0.6
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-processing-multiple-RDDs-tp18628p18659.html
Sent from
RDDs
Hi,
How is 78g distributed in driver, daemon, executor ?
Can you please paste the logs regarding that I don't have enough memory to
hold the data in memory
Are you collecting any data in driver ?
Lastly, did you try doing a re-partition to create smaller and evenly
distributed
Hi,
Is there a nice or optimal method to randomly shuffle spark streaming RDDs?
Thanks,
Josh
I think the answer will be the same in streaming as in the core. You
want a random permutation of an RDD? in general RDDs don't have
ordering at all -- excepting when you sort for example -- so a
permutation doesn't make sense. Do you just want a well-defined but
random ordering of the data? Do
When I'm outputting the RDDs to an external source, I would like the RDDs
to be outputted in a random shuffle so that even the order is random. So
far what I understood is that the RDDs do have a type of order, in that the
order for spark streaming RDDs would be the order in which spark streaming
A use case would be helpful?
Batches of RDDs from Streams are going to have temporal ordering in terms of
when they are processed in a typical application ... , but maybe you could
shuffle the way batch iterations work
On Nov 3, 2014, at 11:59 AM, Josh J joshjd...@gmail.com wrote:
When
a sortBy() on a good hash
function of each value plus some salt? (Haven't thought this through
much but sounds about right.)
On Mon, Nov 3, 2014 at 4:59 PM, Josh J joshjd...@gmail.com wrote:
When I'm outputting the RDDs to an external source, I would like the RDDs to
be outputted in a random shuffle
is guaranteed about that.
If you want to permute an RDD, how about a sortBy() on a good hash
function of each value plus some salt? (Haven't thought this through
much but sounds about right.)
On Mon, Nov 3, 2014 at 4:59 PM, Josh J joshjd...@gmail.com wrote:
When I'm outputting the RDDs
-RDDs-within-a-DStream-tp17740p17800.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h
Hi Harold,
Can you include the versions of spark and spark-cassandra-connector you are
using?
Thanks!
Helena
@helenaedelson
On Oct 30, 2014, at 12:58 PM, Harold Nguyen har...@nexgate.com wrote:
Hi all,
I'd like to be able to modify values in a DStream, and then send it off to an
://apache-spark-user-list.1001560.n3.nabble.com/Manipulating-RDDs-within-a-DStream-tp17740p17800.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr
Hi Harold,
Yes, that is the problem :) Sorry for the confusion, I will make this clear in
the docs ;) since master is work for the next version.
All you need to do is use
spark 1.1.0 as you have it already
com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1”
and assembly - not from
301 - 400 of 542 matches
Mail list logo