in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Manipulating-RDDs-within-a-DStream-tp17740p17800.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user
I'm processing some data using PySpark and I'd like to save the RDDs to disk
(they are (k,v) RDDs of strings and SparseVector types) and read them in
using Scala to run them through some other analysis. Is this possible?
Thanks,
Rok
--
View this message in context:
http://apache-spark-user
Hi,
Sorry, there's a typo there:
val arr = rdd.toArray
Harold
On Thu, Oct 30, 2014 at 9:58 AM, Harold Nguyen har...@nexgate.com wrote:
Hi all,
I'd like to be able to modify values in a DStream, and then send it off to
an external source like Cassandra, but I keep getting Serialization
Hi all,
I'd like to be able to modify values in a DStream, and then send it off to
an external source like Cassandra, but I keep getting Serialization errors
and am not sure how to use the correct design pattern. I was wondering if
you could help me.
I'd like to be able to do the following:
a DStream of single-element RDDs, the
element being the count, and
- dstream.foreachRDD(_.count()) which returns the count directly.
In the first case, some random worker node is chosen for the reduce, in
another the driver is chosen for the reduce. There should not be a
significant performance
(count, reduce): *Here there is another
subtle execution difference between
- dstream.count() which produces a DStream of single-element RDDs, the
element being the count, and
- dstream.foreachRDD(_.count()) which returns the count directly.
In the first case, some random worker node is chosen
- dstream.count() which produces a DStream of single-element RDDs, the
element being the count, and
- dstream.foreachRDD(_.count()) which returns the count directly.
In the first case, some random worker node is chosen for the reduce, in
another the driver is chosen for the reduce. There should
Hi,
How could I combine rdds? I would like to combine two RDDs if the count in
an RDD is not above some threshold.
Thanks,
Josh
this requires evaluation of the rdd to do the count.
val x: RDD[X] = ...
val y: RDD[X] = ...
x.cache
val z = if(x.count thres) x.union(y) else x
On Oct 27, 2014 7:51 PM, Josh J joshjd...@gmail.com wrote:
Hi,
How could I combine rdds? I would like to combine two RDDs if the count in
an RDD
between
- dstream.count() which produces a DStream of single-element RDDs, the
element being the count, and
- dstream.foreachRDD(_.count()) which returns the count directly.
In the first case, some random worker node is chosen for the reduce, in
another the driver is chosen for the reduce
Thanks Matt,
Unlike the feared RDD operations on the driver, it's my understanding that
these Dstream ops on the driver are merely creating an execution plan for
each RDD.
My question still remains: Is it better to foreachRDD early in the process
or do as much Dstream transformations before going
PS: Just to clarify my statement:
Unlike the feared RDD operations on the driver, it's my understanding
that these Dstream ops on the driver are merely creating an execution plan
for each RDD.
With feared RDD operations on the driver I meant to contrast an rdd
action like rdd.collect that would
Hello,
I would like to parallelize my work on multiple RDDs I have. I wanted
to know if spark can support a foreach on an RDD of RDDs. Here's a
java example:
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName(testapp
No, there's no such thing as an RDD of RDDs in Spark.
Here though, why not just operate on an RDD of Lists? or a List of RDDs?
Usually one of these two is the right approach whenever you feel
inclined to operate on an RDD of RDDs.
On Wed, Oct 22, 2014 at 3:58 PM, Tomer Benyamini tomer
Another approach could be to create artificial keys for each RDD and
convert to PairRDDs. So your first RDD becomes
JavaPairRDDInt,String rdd1 with values 1,1 ; 1,2 and so on
Second RDD becomes rdd2 is 2, a; 2, b;2,c
You can union the two RDDs, groupByKey, countByKey etc and maybe achieve
what
On Wednesday, October 22, 2014 9:06 AM, Sean Owen so...@cloudera.com wrote:
No, there's no such thing as an RDD of RDDs in Spark.
Here though, why not just operate on an RDD of Lists? or a List of RDDs?
Usually one of these two is the right approach whenever you feel
inclined to operate
Hi Spark ! I found out why my RDD's werent coming through in my spark
stream.
It turns out you need the onStart() needs to return , it seems - i.e. you
need to launch the worker part of your
start process in a thread. For example
def onStartMock():Unit ={
val future = new Thread(new
(x)
eat_2GB_of_ram()
take_2h()
return my_100MB_array
sc.parallelize(np.arange(100)).map(f).saveAsPickleFile(s3n://blah...)
The resulting rdds will most likely not fit in memory but for this use case
I don't really care. I know I can persist RDDs, but is there any way to
by-default disk
Oh - and one other note on this, which appears to be the case.
If , in your stream forEachRDD implementation, you do something stupid
(like call rdd.count())
tweetStream.foreachRDD((rdd,lent)= {
tweetStream.repartition(1)
numTweetsCollected+=1;
//val count = rdd.count()
Pinging TD -- I'm sure you know :-)
-kr, Gerard.
On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas gerard.m...@gmail.com wrote:
Hi,
We have been implementing several Spark Streaming jobs that are basically
processing data and inserting it into Cassandra, sorting it among different
keyspaces.
http://spark.apache.org/docs/latest/streaming-programming-guide.html
http://spark.apache.org/docs/latest/streaming-programming-guide.html
foreachRDD is executed on the driver….
mn
On Oct 20, 2014, at 3:07 AM, Gerard Maas gerard.m...@gmail.com wrote:
Pinging TD -- I'm sure you know :-)
Hi,
We have been implementing several Spark Streaming jobs that are basically
processing data and inserting it into Cassandra, sorting it among different
keyspaces.
We've been following the pattern:
dstream.foreachRDD(rdd =
val records = rdd.map(elem = record(elem))
Hi, my programming model requires me to generate multiple RDDs for various
datasets across a single run and then run an action on it - E.g.
MyFunc myFunc = ... //It implements VoidFunction
//set some extra variables - all serializable
...
for (JavaRDDString rdd: rddList) {
...
sc.foreach(myFunc
Excuse me - the line inside the loop should read: rdd.foreach(myFunc) - not
sc.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-an-action-inside-a-loop-across-multiple-RDDs-java-io-NotSerializableException-tp16580p16581.html
Sent from the Apache
to generate multiple RDDs for various
datasets across a single run and then run an action on it - E.g.
MyFunc myFunc = ... //It implements VoidFunction
//set some extra variables - all serializable
...
for (JavaRDDString rdd: rddList) {
...
sc.foreach(myFunc);
}
The problem I'm seeing is that after
);
...
}
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-an-action-inside-a-loop-across-multiple-RDDs-java-io-NotSerializableException-tp16580p16597.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Hello,
Given the following structure, is it possible to query, e.g. session[0].id ?
In general, is it possible to query Array Of Struct in json RDDs?
root
|-- createdAt: long (nullable = true)
|-- id: string (nullable = true)
|-- sessions: array (nullable = true)
||-- element
If you are using HiveContext, it should work in 1.1.
Thanks,
Yin
On Mon, Oct 13, 2014 at 5:08 AM, shahab shahab.mok...@gmail.com wrote:
Hello,
Given the following structure, is it possible to query, e.g. session[0].id
?
In general, is it possible to query Array Of Struct in json RDDs
Hi spark !
I dont quite yet understand the semantics of RDDs in a streaming context
very well yet.
Are there any examples of how to implement CustomInputDStreams, with
corresponding Receivers in the docs ?
Ive hacked together a custom stream, which is being opened and is
consuming data
Hi,
Is there a good way to materialize derivate RDDs from say, a HadoopRDD
while reading in the data only once. One way to do so would be to cache
the HadoopRDD and then create derivative RDDs, but that would require
enough RAM to cache the HadoopRDD which is not an option in my case.
Thanks
. But maybe
there is expensive work that happens in between reading the raw data
and re-using results, so it's still a win.
There's no equivalent of MultipleOutputs.
On Thu, Oct 9, 2014 at 10:55 PM, Akshat Aranya aara...@gmail.com wrote:
Hi,
Is there a good way to materialize derivate RDDs from say
Spark
such as, can it handle 100k or even 10 million stages? Can this clever hacky
strategy get around the limitation of only managing RDDs from the driver?
Can I iterate over permutations (as with nesting) of an RDD set without
calling cartesian() and having memory explosion?
I've been using Spark
Hi,
I want to make the following changes in the RDD (create new RDD from the
existing to reflect some transformation):
In an RDD of key-value pair, I want to get the keys for which the values
are 1.
How to do this using map()?
Thank You
You don't. That's what filter or the partial function version of collect
are for:
val transformedRDD = yourRDD.collect { case (k, v) if k == 1 = v }
On Wed, Sep 17, 2014 at 3:24 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
I want to make the following changes in the RDD (create new
.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Change-RDDs-using-map-tp14436p14481.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user
Hi,
We all know that RDDs are immutable.
There are not enough operations that can achieve anything and everything on
RDDs.
Take for example this:
I want an Array of Bytes filled with zeros which during the program should
change. Some elements of that Array should change to 1.
If I make an RDD
know that RDDs are immutable.
There are not enough operations that can achieve anything and everything
on RDDs.
Take for example this:
I want an Array of Bytes filled with zeros which during the program should
change. Some elements of that Array should change to 1.
If I make an RDD with all
filed jira SPARK-3489 https://issues.apache.org/jira/browse/SPARK-3489
On Thu, Sep 4, 2014 at 9:36 AM, Mohit Jaggi mohitja...@gmail.com wrote:
Folks,
I sent an email announcing
https://github.com/AyasdiOpenSource/df
This dataframe is basically a map of RDDs of columns(along with DSL
I see that the tachyon url constructed for an rdd partition has executor id
in it. So if the same partition is being processed by a different executor
on a reexecution of the same computation, it cannot really use the earlier
result. Is this a correct assessment? Will removing the executor id from
Your question is a bit confusing..
I assume you have a RDD containing nodes some meta data (child nodes
maybe) you are trying to attach another metadata to it (bye array). if
its just same byte array for all nodes you can generate rdd with the count
of nodes zip the two rdd together, you can
Hi,
I have an input file which consists of stc_node dest_node
I have created and RDD consisting of key-value pair where key is the node
id and the values are the children of that node.
Now I want to associate a byte with each node. For that I have created a
byte array.
Every time I print out the
: RE: RDDs
Thank you Raymond and Tobias.
Yeah, I am very clear about what I was asking. I was talking about replicated
rdd only. Now that I've got my understanding about job and application
validated, I wanted to know if we can replicate an rdd and run two jobs (that
need same rdd
-- Forwarded message --
From: rapelly kartheek kartheek.m...@gmail.com
Date: Thu, Sep 4, 2014 at 11:49 AM
Subject: Re: RDDs
To: Liu, Raymond raymond@intel.com
Thank you Raymond.
I am more clear now. So, if an rdd is replicated over multiple nodes (i.e.
say two sets of nodes
: Thursday, September 04, 2014 1:24 PM
To: u...@spark.incubator.apache.org
Subject: RE: RDDs
Thank you Raymond and Tobias.
Yeah, I am very clear about what I was asking. I was talking about
replicated rdd only. Now that I've got my understanding about job and
application validated, I wanted
Thank you yuanbosoft.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p13444.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e
Folks,
I sent an email announcing
https://github.com/AyasdiOpenSource/df
This dataframe is basically a map of RDDs of columns(along with DSL sugar),
as column based operations seem to be most common. But row operations are
not uncommon. To get rows out of columns right now I zip the column RDDs
Hi,
Can someone tell me what kind of operations can be performed on a
replicated rdd?? What are the use-cases of a replicated rdd.
One basic doubt that is bothering me from long time: what is the difference
between an application and job in the Spark parlance. I am confused b'cas
of Hadoop
-guide.html#resilient-distributed-datasets-rdds
as an introduction, it lists a lot of the transformations and output
operations you can use.
Personally, I also found it quite helpful to read the paper about RDDs:
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
One basic doubt
:03 PM
To: user@spark.apache.org
Subject: RDDs
Hi,
Can someone tell me what kind of operations can be performed on a replicated
rdd?? What are the use-cases of a replicated rdd.
One basic doubt that is bothering me from long time: what is the difference
between an application and job in the Spark
in parallel?.
-Karthk
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p13416.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail
Hi Folks,
I’d like to find out tips on how to convert the RDDs inside a Spark Streaming
DStream to a set of SchemaRDDs.
My DStream contains JSON data pushed over from Kafka, and I’d like to use
SparkSQL’s JSON import function (i.e. jsonRDD) to register the JSON dataset as
a table, and perform
like to find out tips on how to convert the RDDs inside a Spark
Streaming DStream to a set of SchemaRDDs.
My DStream contains JSON data pushed over from Kafka, and I’d like to use
SparkSQL’s JSON import function (i.e. jsonRDD) to register the JSON dataset
as a table, and perform queries
Hi
I have a three node spark cluster. I restricted the resources per
application by setting appropriate parameters and I could run two
applications simultaneously. Now, I want to replicate an RDD and run two
applications simultaneously. Can someone help how to go about doing this!!!
I replicated
-list.1001560.n3.nabble.com/Re-Out-of-memory-on-large-RDDs-tp2533p2534.html
To start a new topic under Apache Spark User List, email [hidden
email] http://user/SendEmail.jtp?type=nodenode=2537i=1
To unsubscribe from Apache Spark User List, click here.
NAML
http://apache-spark-user-list.1001560
mode. Previously my application was working well ( several
RDDs the largest being around 50G).
When I started processing larger amounts of data (RDDs of 100G) my app
is losing executors. Im currently just loading them from a database,
rePartitioning and persisting to disk (with replication x2)
I
println(parts(0)) does not solve the problem. It does not work
On Mon, Aug 25, 2014 at 1:30 PM, Sean Owen so...@cloudera.com wrote:
On Mon, Aug 25, 2014 at 7:18 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
When I add
parts(0).collect().foreach(println)
to the
discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Re-Out-of-memory-on-large-RDDs-tp2533p2534.html
To start a new topic under Apache Spark User List, email [hidden email]
http://user/SendEmail.jtp?type=nodenode=2537i=1
To unsubscribe from Apache Spark User List, click
the SparkPageRank code and want to see the
intermediate steps, like the RDDs formed in the intermediate steps.
Here is a part of the code along with the lines that I added in order to
print the RDDs.
I want to print the *parts* in the code (denoted by the comment in
Bold letters). But, when I try to do
On Mon, Aug 25, 2014 at 7:18 AM, Deep Pradhan pradhandeep1...@gmail.com wrote:
When I add
parts(0).collect().foreach(println)
parts(1).collect().foreach(println), for printing parts, I get the following
error
not enough arguments for method collect: (pf:
PartialFunction[Char,B])(implicit
Hi,
I was going through the SparkPageRank code and want to see the intermediate
steps, like the RDDs formed in the intermediate steps.
Here is a part of the code along with the lines that I added in order to
print the RDDs.
I want to print the *parts* in the code (denoted by the comment in Bold
Hi,
What kind of error do you receive?
Best regards,
Jörn
Le 24 août 2014 08:29, Deep Pradhan pradhandeep1...@gmail.com a écrit :
Hi,
I was going through the SparkPageRank code and want to see the
intermediate steps, like the RDDs formed in the intermediate steps.
Here is a part
Hi all,
I have a spark cluster of 30 machines, 16GB / 8 cores on each running in
standalone mode. Previously my application was working well ( several
RDDs the largest being around 50G).
When I started processing larger amounts of data (RDDs of 100G) my app
is losing executors. Im currently
Hi All.
I need to create a lot of RDDs starting from a set of roots and count the
rows in each. Something like this:
final JavaSparkContext sc = new JavaSparkContext(conf);
ListString roots = ...
MapString, Object res = sc.parallelize(roots).mapToPair(new
PairFunctionString, String, Long
You won't be able to use RDDs inside of RDD operation. I imagine your
immediate problem is that the code you've elided references 'sc' and
that gets referenced by the PairFunction and serialized, but it can't
be.
If you want to play it this way, parallelize across roots in Java.
That is just use
PM, Sean Owen so...@cloudera.com wrote:
You won't be able to use RDDs inside of RDD operation. I imagine your
immediate problem is that the code you've elided references 'sc' and
that gets referenced by the PairFunction and serialized, but it can't
be.
If you want to play it this way
number, GraphX jobs will throw:
java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of
partitions
So my quick fix is to repartition the EdgeRDD to exactly the number of
parallelism. But I think this would lead to much network communication.
So is there any other better
Hi All,
I met the titled error. This exception occured in line 223, as shown below:
212 // read files
213 val lines =
sc.textFile(path_edges).map(line=line.split(,)).map(line=((line(0),
line(1)), line(2).toDouble)).reduceByKey(_+
_).cache
214
215 val
across RDDs and have spark execute them efficiently.
Our use case for a feature like this is processing many records and
attaching metadata to the records during processing about our confidence in
the data-points, and then writing the data to one spot and the metadata to
another spot.
I've also wanted
RDDs are touched in every batch. Since spark
streaming is not really a dedicated data store, its not really designed to
separate out hot data and cold data.
2. For each key, in the state you could maintain a timestamp of when it was
updated and accordingly return None to filter that state out
. Additionally, given that my data is
partitionable by datetime, does it make sense to have a custom
datetime partitioner, and just persist the dstream to disk, to ensure
that its RDDs are only pulled off of disk (into memory) occasionally?
What's the cost of having a bunch of relatively large, stateful RDDs
Has anyone reported issues using SparkSQL with sequence files (all of our
data is in this format within HDFS)? We are considering whether to burn
the time upgrading to Spark 1.0 from 0.9 now and this is a main decision
point for us.
I haven't heard any reports of this yet, but I don't see any reason why it
wouldn't work. You'll need to manually convert the objects that come out of
the sequence file into something where SparkSQL can detect the schema (i.e.
scala case classes or java beans) before you can register the RDD as a
: Mon, 7 Jul 2014 17:12:42 -0700
Subject: Re: SparkSQL with sequence file RDDs
To: user@spark.apache.org
I haven't heard any reports of this yet, but I don't see any reason why it
wouldn't work. You'll need to manually convert the objects that come out of the
sequence file into something where
We know Scala 2.11 has remove the limitation of parameter number, but
Spark 1.0 is not compatible with it. So now we are considering use java
beans instead of Scala case classes.
You can also manually create a class that implements scala's Product
interface. Finally, SPARK-2179
...@databricks.com
Date: Mon, 7 Jul 2014 17:52:34 -0700
Subject: Re: SparkSQL with sequence file RDDs
To: user@spark.apache.org
We know Scala 2.11 has remove the limitation of parameter number, but Spark 1.0
is not compatible with it. So now we are considering use java beans instead of
Scala case
--
From: mich...@databricks.com
Date: Mon, 7 Jul 2014 17:52:34 -0700
Subject: Re: SparkSQL with sequence file RDDs
To: user@spark.apache.org
We know Scala 2.11 has remove the limitation of parameter number, but
Spark 1.0 is not compatible with it. So now we
A lot of RDD that you create in Code may not even be constructed as the
tasks layer is optimized in the DAG scheduler.. The closest is onUnpersistRDD
in SparkListner.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon,
Hi all,
I am trying to create a custom RDD class for result set of queries
supported in InMobi Grill (http://inmobi.github.io/grill/)
Each result set has a schema (similar to Hive's TableSchema) and a path in
HDFS containing the result set data.
An easy way of doing this would be to create a
Yep exactly! I’m not sure how complicated it would be to pull off. If someone
wouldn’t mind helping to get me pointed in the right direction I would be happy
to look into and contribute this functionality. I imagine this would be
implemented in the scheduler codebase and there would be some
This would be really useful. Especially for Shark where shift of
partitioning effects all subsequent queries unless task scheduling time
beats spark.locality.wait. Can cause overall low performance for all
subsequent tasks.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
We have a use case where we’d like something to execute once on each node and I
thought it would be good to ask here.
Currently we achieve this by setting the parallelism to the number of nodes and
use a mod partitioner:
val balancedRdd = sc.parallelize(
(0 until Settings.parallelism)
-contractor.com wrote:
Thanks Krishna. Seems like you have to use Avro and then convert that
to Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll
look into this some more.
Thanks,
Mahesh
From: Krishna Sankar ksanka...@gmail.com
Reply-To: user@spark.apache.org user
RDDs to Parquet records
I have similar case where I have RDD [List[Any], List[Long] ] and wants to save
it as Parquet file.
My understanding is that only RDD of case classes can be converted to
SchemaRDD. So is there any way I can save this RDD as Parquet file without
using Avro?
Thanks
Tailor [via Apache Spark User List] [hidden email]
http://user/SendEmail.jtp?type=nodenode=7971i=0
Date: Thursday, June 19, 2014 at 12:53 PM
To: Mahesh Padmanabhan [hidden email]
http://user/SendEmail.jtp?type=nodenode=7971i=1
Subject: Re: Spark streaming RDDs to Parquet records
I have
Hello,
Is there an easy way to convert RDDs within a DStream into Parquet records?
Here is some incomplete pseudo code:
// Create streaming context
val ssc = new StreamingContext(...)
// Obtain a DStream of events
val ds = KafkaUtils.createStream(...)
// Get Spark context to get to the SQL
, maheshtwc
mahesh.padmanab...@twc-contractor.com wrote:
Hello,
Is there an easy way to convert RDDs within a DStream into Parquet records?
Here is some incomplete pseudo code:
// Create streaming context
val ssc = new StreamingContext(...)
// Obtain a DStream of events
val ds
Thanks Krishna. Seems like you have to use Avro and then convert that to
Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look into
this some more.
Thanks,
Mahesh
From: Krishna Sankar ksanka...@gmail.commailto:ksanka...@gmail.com
Reply-To: user@spark.apache.orgmailto:user
to use Avro and then convert that to
Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look
into this some more.
Thanks,
Mahesh
From: Krishna Sankar ksanka...@gmail.com
Reply-To: user@spark.apache.org user@spark.apache.org
Date: Tuesday, June 17, 2014 at 2:41 PM
Hi,
How do I check the rdds that I have persisted? I have some code that looks
like:
rd1.cache()
rd2.cache()
...
rdN.cache()
How can I unpersist all rdd's at once? And is it possible to get the names
of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)?
Thank you
Hi,
How do I check the rdds that I have persisted? I have some code that looks
like:
rd1.cache()
rd2.cache()
...
rdN.cache()
How can I unpersist all rdd's at once? And is it possible to get the names
of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)?
Thank you
Check out SparkContext.getPersistentRDDs!
On Fri, Jun 13, 2014 at 1:06 PM, mrm ma...@skimlinks.com wrote:
Hi,
How do I check the rdds that I have persisted? I have some code that looks
like:
rd1.cache()
rd2.cache()
...
rdN.cache()
How can I unpersist all rdd's at once
appreciate it if you could help me with this, I have tried different
ways and googling it! I suspect it might be a silly error but I can't figure
it out.
Maria
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564p7569.html
Sent from
-of-persisted-rdds-tp7564p7569.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
different
ways and googling it! I suspect it might be a silly error but I can't
figure
it out.
Maria
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564p7569.html
Sent from the Apache Spark User List mailing list archive
Hi Nick,
Thank you for the reply, I forgot to mention I was using pyspark in my first
message.
Maria
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564p7581.html
Sent from the Apache Spark User List mailing list archive
this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564p7581.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I'm wondering whether it's possible to continuously merge the RDDs coming
from a stream into a single RDD efficiently.
One thought is to use the union() method. But using union, I will get a new
RDD each time I do a merge. I don't know how I should name these RDDs,
because I remember Spark
to the RDD data, so the potential easiest way is solve the problem at
hand is to create several RDDs from the original RDD.
The issue I see is that the 'sc.makeRDD(v.toSeq)' will potentially blow
when trying to materialize the iterator into a seq. I also don't know what
the behaviour of that call
the
problem at hand is to create several RDDs from the original RDD.
The issue I see is that the 'sc.makeRDD(v.toSeq)' will potentially blow
when trying to materialize the iterator into a seq. I also don't know what
the behaviour of that call to SparkContext will be on a remote worker.
My
The RDD API has functions to join multiple RDDs, such as PariRDD.join
or PariRDD.cogroup that take another RDD as input. e.g.
firstRDD.join(secondRDD)
I'm looking for ways to do the opposite: split an existing RDD. What is the
right way to create derivate RDDs from an existing RDD?
e.g
401 - 500 of 542 matches
Mail list logo