Iterating on RDDs

2015-02-26 Thread Vijayasarathy Kannan
Hi, I have the following use case. (1) I have an RDD of edges of a graph (say R). (2) do a groupBy on R (by say source vertex) and call a function F on each group. (3) collect the results from Fs and do some computation (4) repeat the above steps until some criteria is met In (2), the groups

Re: Iterating on RDDs

2015-02-26 Thread Imran Rashid
val grouped = R.groupBy[VertexId](G).persist(StorageLeve.MEMORY_ONLY_SER) // or whatever persistence makes more sense for you ... while(true) { val res = grouped.flatMap(F) res.collect.foreach(func) if(criteria) break } On Thu, Feb 26, 2015 at 10:56 AM, Vijayasarathy Kannan

Spark Streaming - Collecting RDDs into array in the driver program

2015-02-25 Thread Thanigai Vellore
I have this function in the driver program which collects the result from rdds (in a stream) into an array and return. However, even though the RDDs (in the dstream) have data, the function is returning an empty array...What am I doing wrong? I can print the RDD values inside the foreachRDD call

Re: Spark Streaming - Collecting RDDs into array in the driver program

2015-02-25 Thread Tobias Pfeiffer
Hi, On Thu, Feb 26, 2015 at 11:24 AM, Thanigai Vellore thanigai.vell...@gmail.com wrote: It appears that the function immediately returns even before the foreachrdd stage is executed. Is that possible? Sure, that's exactly what happens. foreachRDD() schedules a computation, it does not

Re: Spark Streaming - Collecting RDDs into array in the driver program

2015-02-25 Thread Thanigai Vellore
thanigai.vell...@gmail.com wrote: I have this function in the driver program which collects the result from rdds (in a stream) into an array and return. However, even though the RDDs (in the dstream) have data, the function is returning an empty array...What am I doing wrong? I can print

Re: Executors dropping all memory stored RDDs?

2015-02-24 Thread Thomas Gerber
of disk. So, in case someone else notices a behavior like this, make sure you check your cluster monitor (like ganglia). On Wed, Jan 28, 2015 at 5:40 PM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, I am storing RDDs with the MEMORY_ONLY_SER Storage Level, during the run of a big job

Is there a limit to the number of RDDs in a Spark context?

2015-02-18 Thread Juan Rodríguez Hortalá
Hi, I'm writing a Spark program where I want to divide a RDD into different groups, but the groups are too big to use groupByKey. To cope with that, since I know in advance the list of keys for each group, I build a map from the keys to the RDDs that result from filtering the input RDD to get

Creating RDDs from within foreachPartition() [Spark-Streaming]

2015-02-18 Thread t1ny
Hi all, I am trying to create RDDs from within /rdd.foreachPartition()/ so I can save these RDDs to ElasticSearch on the fly : stream.foreachRDD(rdd = { rdd.foreachPartition { iterator = { val sc = rdd.context iterator.foreach { case (cid

Re: Is there a limit to the number of RDDs in a Spark context?

2015-02-18 Thread Juan Rodríguez Hortalá
Hi Sean, Thanks a lot for your answer. That explains it, as I was creating thousands of RDDs, so I guess the communication overhead was the reason why the Spark job was freezing. After changing the code to use RDDs of pairs and aggregateByKey it works just fine, and quite fast. Again, thanks

Re: Is there a limit to the number of RDDs in a Spark context?

2015-02-18 Thread Sean Owen
At some level, enough RDDs creates hundreds of thousands of tiny partitions of data each of which creates a task for each stage. The raw overhead of all the message passing can slow things down a lot. I would not design something to use an RDD per key. You would generally use key by some value you

Re: Creating RDDs from within foreachPartition() [Spark-Streaming]

2015-02-18 Thread Sean Owen
You can't use RDDs inside RDDs. RDDs are managed from the driver, and functions like foreachRDD execute things on the remote executors. You can write code to simply directly save whatever you want to ES. There is not necessarily a need to use RDDs for that. On Wed, Feb 18, 2015 at 11:36 AM, t1ny

Re: Is there a limit to the number of RDDs in a Spark context?

2015-02-18 Thread Juan Rodríguez Hortalá
Hi Paweł, Thanks a lot for your answer. I finally got the program to work by using aggregateByKey, but I was wondering why creating thousands of RDDs doesn't work. I think that could be interesting for using methods that work on RDDs like for example JavaDoubleRDD.stats() ( http

Re: Is there a limit to the number of RDDs in a Spark context?

2015-02-18 Thread Paweł Szulc
juan.rodriguez.hort...@gmail.com wrote: Hi, I'm writing a Spark program where I want to divide a RDD into different groups, but the groups are too big to use groupByKey. To cope with that, since I know in advance the list of keys for each group, I build a map from the keys to the RDDs that result from

Re: Shuffle on joining two RDDs

2015-02-16 Thread Davies Liu
by union-ing the two transformed rdds, which is very different from the way it works under the hood in scala to enable narrow dependencies. It really needs a bigger change to pyspark. I filed this issue: https://issues.apache.org/jira/browse/SPARK-5785 (and the somewhat related issue about

Re: Shuffle on joining two RDDs

2015-02-13 Thread Karlson
at rdd.dependencies. its a little tricky, though, you need to look deep into the lineage (example at the end). Sean -- yes it does require both RDDs have the same partitioner, but that should happen naturally if you just specify the same number of partitions, you'll get equal HashPartitioners

Re: Shuffle on joining two RDDs

2015-02-13 Thread Karlson
are hidden deeper inside and are not user-visible. But I think a better way (in scala, anyway) is to look at rdd.dependencies. its a little tricky, though, you need to look deep into the lineage (example at the end). Sean -- yes it does require both RDDs have the same partitioner, but that should

Columnar-Oriented RDDs

2015-02-13 Thread Night Wolf
Hi all, I'd like to build/use column oriented RDDs in some of my Spark code. A normal Spark RDD is stored as row oriented object if I understand correctly. I'd like to leverage some of the advantages of a columnar memory format. Shark (used to) and SparkSQL uses a columnar storage format using

Re: Shuffle on joining two RDDs

2015-02-13 Thread Imran Rashid
yeah I thought the same thing at first too, I suggested something equivalent w/ preservesPartitioning = true, but that isn't enough. the join is done by union-ing the two transformed rdds, which is very different from the way it works under the hood in scala to enable narrow dependencies

Re: Columnar-Oriented RDDs

2015-02-13 Thread Michael Armbrust
to build/use column oriented RDDs in some of my Spark code. A normal Spark RDD is stored as row oriented object if I understand correctly. I'd like to leverage some of the advantages of a columnar memory format. Shark (used to) and SparkSQL uses a columnar storage format using primitive

Re: Columnar-Oriented RDDs

2015-02-13 Thread Koert Kuipers
all, I'd like to build/use column oriented RDDs in some of my Spark code. A normal Spark RDD is stored as row oriented object if I understand correctly. I'd like to leverage some of the advantages of a columnar memory format. Shark (used to) and SparkSQL uses a columnar storage format using

Re: Shuffle on joining two RDDs

2015-02-12 Thread Karlson
Hi, I believe that partitionBy will use the same (default) partitioner on both RDDs. On 2015-02-12 17:12, Sean Owen wrote: Doesn't this require that both RDDs have the same partitioner? On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid iras...@cloudera.com wrote: Hi Karlson, I think your

Shuffle on joining two RDDs

2015-02-12 Thread Karlson
Hi All, using Pyspark, I create two RDDs (one with about 2M records (~200MB), the other with about 8M records (~2GB)) of the format (key, value). I've done a partitionBy(num_partitions) on both RDDs and verified that both RDDs have the same number of partitions and that equal keys reside

Re: Shuffle on joining two RDDs

2015-02-12 Thread Imran Rashid
, anyway) is to look at rdd.dependencies. its a little tricky, though, you need to look deep into the lineage (example at the end). Sean -- yes it does require both RDDs have the same partitioner, but that should happen naturally if you just specify the same number of partitions, you'll get

Re: Shuffle on joining two RDDs

2015-02-12 Thread Imran Rashid
Hi Karlson, I think your assumptions are correct -- that join alone shouldn't require any shuffling. But its possible you are getting tripped up by lazy evaluation of RDDs. After you do your partitionBy, are you sure those RDDs are actually materialized cached somewhere? eg., if you just did

Re: Shuffle on joining two RDDs

2015-02-12 Thread Sean Owen
Doesn't this require that both RDDs have the same partitioner? On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid iras...@cloudera.com wrote: Hi Karlson, I think your assumptions are correct -- that join alone shouldn't require any shuffling. But its possible you are getting tripped up by lazy

Re: Shuffle on joining two RDDs

2015-02-12 Thread Imran Rashid
, anyway) is to look at rdd.dependencies. its a little tricky, though, you need to look deep into the lineage (example at the end). Sean -- yes it does require both RDDs have the same partitioner, but that should happen naturally if you just specify the same number of partitions, you'll get equal

Re: Zipping RDDs of equal size not possible

2015-02-05 Thread Niklas Wilcke
) } On 10.01.2015 06:56, Xiangrui Meng wrote: sample 2 * n tuples, split them into two parts, balance the sizes of these parts by filtering some tuples out How do you guarantee that the two RDDs have the same size? -Xiangrui On Fri, Jan 9, 2015 at 3:40 AM, Niklas Wilcke 1wil

New combination-like RDD based on two RDDs

2015-02-04 Thread dash
Hey Spark gurus! Sorry for the confusing title. I do not know the exactly description of my problem, if you know please tell me so I can change it :-) Say I have two RDDs right now, and they are val rdd1 = sc.parallelize(List((1,(3)), (2,(5)), (3,(6 val rdd2 = sc.parallelize(List((2,(1)), (2

Re: New combination-like RDD based on two RDDs

2015-02-04 Thread dash
(println) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-combination-like-RDD-based-on-two-RDDs-tp21508p21511.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Joining piped RDDs

2015-02-04 Thread Pavel Velikhov
solution we have is to format the data into the pipe, so it includes the ids, and then restore it all in a map after the pipe. However it would be much nicer if we could just join/zip back the output of the pipe. However we can’t cache the RDDs, so it would be nice to have a forkRDD of some

Trying to find where Spark persists RDDs when run with YARN

2015-01-18 Thread Hemanth Yamijala
Hi, I am trying to find where Spark persists RDDs when we call the persist() api and executed under YARN. This is purely for understanding... In my driver program, I wait indefinitely, so as to avoid any clean up problems. In the actual job, I roughly do the following: JavaRDDString lines

Re: Trying to find where Spark persists RDDs when run with YARN

2015-01-18 Thread Sean Owen
RDDs when we call the persist() api and executed under YARN. This is purely for understanding... In my driver program, I wait indefinitely, so as to avoid any clean up problems. In the actual job, I roughly do the following: JavaRDDString lines = context.textFile(args[0]); lines.persist

Re: Zipping RDDs of equal size not possible

2015-01-09 Thread Xiangrui Meng
sample 2 * n tuples, split them into two parts, balance the sizes of these parts by filtering some tuples out How do you guarantee that the two RDDs have the same size? -Xiangrui On Fri, Jan 9, 2015 at 3:40 AM, Niklas Wilcke 1wil...@informatik.uni-hamburg.de wrote: Hi Spark community, I have

Zipping RDDs of equal size not possible

2015-01-09 Thread Niklas Wilcke
Hi Spark community, I have a problem with zipping two RDDs of the same size and same number of partitions. The error message says that zipping is only allowed on RDDs which are partitioned into chunks of exactly the same sizes. How can I assure this? My workaround at the moment is to repartition

Re: Join RDDs with DStreams

2015-01-08 Thread Akhil Das
to join non-DStream RDDs with DStream RDDs? Here is the use case. I have a lookup table stored in HDFS that I want to read as an RDD. Then I want to join it with the RDDs that are coming in through the DStream. How can I do this? Thanks. Asim

Join RDDs with DStreams

2015-01-08 Thread Asim Jalis
Is there a way to join non-DStream RDDs with DStream RDDs? Here is the use case. I have a lookup table stored in HDFS that I want to read as an RDD. Then I want to join it with the RDDs that are coming in through the DStream. How can I do this? Thanks. Asim

Re: Join RDDs with DStreams

2015-01-08 Thread Gerard Maas
: Is there a way to join non-DStream RDDs with DStream RDDs? Here is the use case. I have a lookup table stored in HDFS that I want to read as an RDD. Then I want to join it with the RDDs that are coming in through the DStream. How can I do this? Thanks. Asim

Re: How to merge a RDD of RDDs into one uber RDD

2015-01-07 Thread Raghavendra Pandey
You can also use join function of rdd. This is actually kind of append funtion that add up all the rdds and create one uber rdd. On Wed, Jan 7, 2015, 14:30 rkgurram rkgur...@gmail.com wrote: Thank you for the response, sure will try that out. Currently I changed my code such that the first

Re: How to merge a RDD of RDDs into one uber RDD

2015-01-07 Thread rkgurram
Thank you for the response, sure will try that out. Currently I changed my code such that the first map files.map to files.flatMap, which I guess will do similar what you are saying, it gives me a List[] of elements (in this case LabeledPoints, I could also do RDDs) which I then turned

Re: How to merge a RDD of RDDs into one uber RDD

2015-01-07 Thread Sean Owen
funtion that add up all the rdds and create one uber rdd. On Wed, Jan 7, 2015, 14:30 rkgurram rkgur...@gmail.com wrote: Thank you for the response, sure will try that out. Currently I changed my code such that the first map files.map to files.flatMap, which I guess will do similar what you

Strange DAG scheduling behavior on currently dependent RDDs

2015-01-07 Thread Corey Nolet
We just updated to Spark 1.2.0 from Spark 1.1.0. We have a small framework that we've been developing that connects various different RDDs together based on some predefined business cases. After updating to 1.2.0, some of the concurrency expectations about how the stages within jobs are executed

Re: Strange DAG scheduling behavior on currently dependent RDDs

2015-01-07 Thread Corey Nolet
I asked this question too soon. I am caching off a bunch of RDDs in a TrieMap so that our framework can wire them together and the locking was not completely correct- therefore it was creating multiple new RDDs at times instead of using cached versions- which were creating completely separate

Re: Cannot see RDDs in Spark UI

2015-01-06 Thread Andrew Ash
Hi Manoj, I've noticed that the storage tab only shows RDDs that have been cached. Did you call .cache() or .persist() on any of the RDDs? Andrew On Tue, Jan 6, 2015 at 6:48 PM, Manoj Samel manojsamelt...@gmail.com wrote: Hi, I create a bunch of RDDs, including schema RDDs. When I run

Re: How to merge a RDD of RDDs into one uber RDD

2015-01-06 Thread k.tham
an RDD cannot contain elements of type RDD. (i.e. you can't nest RDDs within RDDs, in fact, I don't think it makes any sense) I suggest rather than having an RDD of file names, collect those file name strings back on to the driver as a Scala array of file names, and then from there, make an array

Cannot see RDDs in Spark UI

2015-01-06 Thread Manoj Samel
Hi, I create a bunch of RDDs, including schema RDDs. When I run the program and go to UI on xxx:4040, the storage tab does not shows any RDDs. Spark version is 1.1.1 (Hadoop 2.3) Any thoughts? Thanks,

Exception after changing RDDs

2014-12-23 Thread kai.lu
$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Exception-after-changing-RDDs-tp20841.html Sent

Re: UNION two RDDs

2014-12-22 Thread Jerry Lam
guarantees about ordering. In practice, you may find that you still encounter the elements in the same order after coalesce(1), although I am not sure that is even true. union() is the same story; unless the RDDs are sorted I don't think there are guarantees. However I'm almost certain

Re: UNION two RDDs

2014-12-19 Thread Sean Owen
() is the same story; unless the RDDs are sorted I don't think there are guarantees. However I'm almost certain that in practice, as it happens now, A's elements would come before B's after a union, if you did traverse them. On Fri, Dec 19, 2014 at 5:41 AM, madhu phatak phatak@gmail.com wrote: Hi

spark-shell bug with RDDs and case classes?

2014-12-19 Thread Jay Hutfles
. Is this something I just have to live with when using the REPL? Or is this covered by something bigger that's being addressed? Thanks in advance -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-bug-with-RDDs-and-case-classes-tp20789.html Sent from

Re: spark-shell bug with RDDs and case classes?

2014-12-19 Thread Sean Owen
just have to live with when using the REPL? Or is this covered by something bigger that's being addressed? Thanks in advance -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-bug-with-RDDs-and-case-classes-tp20789.html Sent from the Apache

UNION two RDDs

2014-12-18 Thread Jerry Lam
Hi Spark users, I wonder if val resultRDD = RDDA.union(RDDB) will always have records in RDDA before records in RDDB. Also, will resultRDD.coalesce(1) change this ordering? Best Regards, Jerry

Re: RDDs being cleaned too fast

2014-12-16 Thread Harihar Nahak
RDD.persist() can be useful here. On 11 December 2014 at 14:34, ankits [via Apache Spark User List] ml-node+s1001560n20613...@n3.nabble.com wrote: I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too fast. How can i inspect the size of RDD in memory and get more information

Re: Locking for shared RDDs

2014-12-11 Thread Tathagata Das
Aditya, I think you have the mental model of spark streaming a little off the mark. Unlike traditional streaming systems, where any kind of state is mutable, SparkStreaming is designed on Sparks immutable RDDs. Streaming data is received and divided into immutable blocks, then form immutable RDDs

Re: RDDs being cleaned too fast

2014-12-11 Thread Ranga
I was having similar issues with my persistent RDDs. After some digging around, I noticed that the partitions were not balanced evenly across the available nodes. After a repartition, the RDD was spread evenly across all available memory. Not sure if that is something that would help your use-case

RDDs being cleaned too fast

2014-12-10 Thread ankits
I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too fast. How can i inspect the size of RDD in memory and get more information about why it was cleaned up. There should be more than enough memory available on the cluster to store them, and by default, the spark.cleaner.ttl

Re: RDDs being cleaned too fast

2014-12-10 Thread Aaron Davidson
The ContextCleaner uncaches RDDs that have gone out of scope on the driver. So it's possible that the given RDD is no longer reachable in your program's control flow, or else it'd be a bug in the ContextCleaner. On Wed, Dec 10, 2014 at 5:34 PM, ankits ankitso...@gmail.com wrote: I'm using spark

Caching RDDs with shared memory - bug or feature?

2014-12-09 Thread insperatum
If all RDD elements within a partition contain pointers to a single shared object, Spark persists as expected when the RDD is small. However, if the RDD is more than *200 elements* then Spark reports requiring much more memory than it actually does. This becomes a problem for large RDDs, as Spark

Locking for shared RDDs

2014-12-08 Thread aditya.athalye
I am relatively new to Spark. I am planning to use Spark Streaming for my OLAP use case, but I would like to know how RDDs are shared between multiple workers. If I need to constantly compute some stats on the streaming data, presumably shared state would have to updated serially by different

Re: Locking for shared RDDs

2014-12-08 Thread Raghavendra Pandey
relatively new to Spark. I am planning to use Spark Streaming for my OLAP use case, but I would like to know how RDDs are shared between multiple workers. If I need to constantly compute some stats on the streaming data, presumably shared state would have to updated serially by different spark

Re: Running two different Spark jobs vs multi-threading RDDs

2014-12-06 Thread Corey Nolet
in the documentation that RDDs can be run concurrently when submitted in separate threads. I'm curious how the scheduler would handle propagating these down to the tasks. I have 3 RDDs: - one RDD which loads some initial data, transforms it and caches it - two RDDs which use the cached RDD to provide

Re: Running two different Spark jobs vs multi-threading RDDs

2014-12-06 Thread Aaron Davidson
is calling a job. I guess application (more than one spark context) is what I'm asking about On Dec 5, 2014 5:19 PM, Corey Nolet cjno...@gmail.com wrote: I've read in the documentation that RDDs can be run concurrently when submitted in separate threads. I'm curious how the scheduler would handle

Running two different Spark jobs vs multi-threading RDDs

2014-12-05 Thread Corey Nolet
I've read in the documentation that RDDs can be run concurrently when submitted in separate threads. I'm curious how the scheduler would handle propagating these down to the tasks. I have 3 RDDs: - one RDD which loads some initial data, transforms it and caches it - two RDDs which use the cached

Determination of number of RDDs

2014-12-04 Thread Deep Pradhan
Hi, I have a graph and I want to create RDDs equal in number to the nodes in the graph. How can I do that? If I have 10 nodes then I want to create 10 rdds. Is that possible in GraphX? Like in C language we have array of pointers. Do we have array of RDDs in Spark. Can we create such an array

Re: Determination of number of RDDs

2014-12-04 Thread Ankur Dave
At 2014-12-04 02:08:45 -0800, Deep Pradhan pradhandeep1...@gmail.com wrote: I have a graph and I want to create RDDs equal in number to the nodes in the graph. How can I do that? If I have 10 nodes then I want to create 10 rdds. Is that possible in GraphX? This is possible: you can collect

RE: Determination of number of RDDs

2014-12-04 Thread Kapil Malik
Regarding: Can we create such an array and then parallelize it? Parallelizing an array of RDDs - i.e. RDD[RDD[x]] is not possible. RDD is not serializable. From: Deep Pradhan [mailto:pradhandeep1...@gmail.com] Sent: 04 December 2014 15:39 To: user@spark.apache.org Subject: Determination

Re: Loading RDDs in a streaming fashion

2014-12-02 Thread Ashish Rangole
This is a common use case and this is how Hadoop APIs for reading data work, they return an Iterator [Your Record] instead of reading every record in at once. On Dec 1, 2014 9:43 PM, Andy Twigg andy.tw...@gmail.com wrote: You may be able to construct RDDs directly from an iterator - not sure

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Andy Twigg
file = tranform file into a bunch of records What does this function do exactly? Does it load the file locally? Spark supports RDDs exceeding global RAM (cf the terasort example), but if your example just loads each file locally, then this may cause problems. Instead, you should load each file

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Keith Simmons
: file = tranform file into a bunch of records What does this function do exactly? Does it load the file locally? Spark supports RDDs exceeding global RAM (cf the terasort example), but if your example just loads each file locally, then this may cause problems. Instead, you should load each file

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Andy Twigg
Could you modify your function so that it streams through the files record by record and outputs them to hdfs, then read them all in as RDDs and take the union? That would only use bounded memory. On 1 December 2014 at 17:19, Keith Simmons ke...@pulse.io wrote: Actually, I'm working

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Keith Simmons
through the files record by record and outputs them to hdfs, then read them all in as RDDs and take the union? That would only use bounded memory. On 1 December 2014 at 17:19, Keith Simmons ke...@pulse.io wrote: Actually, I'm working with a binary format. The api allows reading out a single

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Andy Twigg
You may be able to construct RDDs directly from an iterator - not sure - you may have to subclass your own. On 1 December 2014 at 18:40, Keith Simmons ke...@pulse.io wrote: Yep, that's definitely possible. It's one of the workarounds I was considering. I was just curious

Re: RDDs join problem: incorrect result

2014-11-30 Thread Harihar Nahak
wrote: Hi, I ran into a problem when doing two RDDs join operation. For example, RDDa: RDD[(String,String)] and RDDb:RDD[(String,Int)]. Then, the result RDDc:[String,(String,Int)] = RDDa.join(RDDb). But I find the results in RDDc are incorrect compared with RDDb. What's wrong in join

RDDs join problem: incorrect result

2014-11-26 Thread liuboya
Hi, I ran into a problem when doing two RDDs join operation. For example, RDDa: RDD[(String,String)] and RDDb:RDD[(String,Int)]. Then, the result RDDc:[String,(String,Int)] = RDDa.join(RDDb). But I find the results in RDDc are incorrect compared with RDDb. What's wrong in join? -- View

How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Blind Faith
Say I have two RDDs with the following values x = [(1, 3), (2, 4)] and y = [(3, 5), (4, 7)] and I want to have z = [(1, 3), (2, 4), (3, 5), (4, 7)] How can I achieve this. I know you can use outerJoin followed by map to achieve this, but is there a more direct way for this.

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Daniel Siegmann
You want to use RDD.union (or SparkContext.union for many RDDs). These don't join on a key. Union doesn't really do anything itself, so it is low overhead. Note that the combined RDD will have all the partitions of the original RDDs, so you may want to coalesce after the union. val x

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Harihar Nahak
from : file2.txt contains (ID, name) val y: RDD[(Long, String)]{where ID is common in both the RDDs} [(4407 ,Jhon), (2064, Maria), (7815 ,Casto), (5736,Ram), (8031,XYZ)] and I'm expecting result should be like this : [(ID, Name, Count)] [(4407 ,Jhon, 40), (2064, Maria, 38), (7815 ,Casto, 10

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Daniel Haviv
is common in both the RDDs} [(4407 ,Jhon), (2064, Maria), (7815 ,Casto), (5736,Ram), (8031,XYZ)] and I'm expecting result should be like this : [(ID, Name, Count)] [(4407 ,Jhon, 40), (2064, Maria, 38), (7815 ,Casto, 10), (5736,Ram, 17), (8031,XYZ, 3)] Any help will really

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Daniel Siegmann
Harihar, your question is the opposite of what was asked. In the future, please start a new thread for new questions. You want to do a join in your case. The join function does an inner join, which I think is what you want because you stated your IDs are common in both RDDs. For other cases you

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Harihar Nahak
-RDDs-with-mutually-exclusive-keys-tp19417p19431.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Transform RDD.groupBY result to multiple RDDs

2014-11-19 Thread Dai, Kevin
Hi, all Suppose I have a RDD of (K, V) tuple and I do groupBy on it with the key K. My question is how to make each groupBy resukt whick is (K, iterable[V]) a RDD. BTW, can we transform it as a DStream and also each groupBY result is a RDD in it? Best Regards, Kevin.

Re: Transform RDD.groupBY result to multiple RDDs

2014-11-19 Thread Sean Owen
What's your use case? You would not generally want to make so many small RDDs. On Nov 20, 2014 6:19 AM, Dai, Kevin yun...@ebay.com wrote: Hi, all Suppose I have a RDD of (K, V) tuple and I do groupBy on it with the key K. My question is how to make each groupBy resukt whick is (K

Declaring multiple RDDs and efficiency concerns

2014-11-14 Thread Simone Franzini
Let's say I have to apply a complex sequence of operations to a certain RDD. In order to make code more modular/readable, I would typically have something like this: object myObject { def main(args: Array[String]) { val rdd1 = function1(myRdd) val rdd2 = function2(rdd1) val rdd3 =

Re: Declaring multiple RDDs and efficiency concerns

2014-11-14 Thread Rishi Yadav
how about using fluent style of Scala programming. On Fri, Nov 14, 2014 at 8:31 AM, Simone Franzini captainfr...@gmail.com wrote: Let's say I have to apply a complex sequence of operations to a certain RDD. In order to make code more modular/readable, I would typically have something like

Re: Declaring multiple RDDs and efficiency concerns

2014-11-14 Thread Sean Owen
This code executes on the driver, and an RDD here is really just a handle on all the distributed data out there. It's a local bookkeeping object. So, manipulation of these objects themselves in the local driver code has virtually no performance impact. These two versions would be about identical*.

Which function in spark is used to combine two RDDs by keys

2014-11-13 Thread Blind Faith
Let us say I have the following two RDDs, with the following key-pair values. rdd1 = [ (key1, [value1, value2]), (key2, [value3, value4]) ] and rdd2 = [ (key1, [value5, value6]), (key2, [value7]) ] Now, I want to join them by key values, so for example I want to return the following

Re: Which function in spark is used to combine two RDDs by keys

2014-11-13 Thread Sonal Goyal
Check cogroup. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, Nov 13, 2014 at 5:11 PM, Blind Faith person.of.b...@gmail.com wrote: Let us say I have the following two RDDs, with the following key-pair values. rdd1

Re: Which function in spark is used to combine two RDDs by keys

2014-11-13 Thread Davies Liu
rdd1.union(rdd2).groupByKey() On Thu, Nov 13, 2014 at 3:41 AM, Blind Faith person.of.b...@gmail.com wrote: Let us say I have the following two RDDs, with the following key-pair values. rdd1 = [ (key1, [value1, value2]), (key2, [value3, value4]) ] and rdd2 = [ (key1, [value5

RE: Help with processing multiple RDDs

2014-11-11 Thread Kapil Malik
? Regards, Kapil -Original Message- From: akhandeshi [mailto:ami.khande...@gmail.com] Sent: 12 November 2014 03:44 To: u...@spark.incubator.apache.org Subject: Help with processing multiple RDDs I have been struggling to process a set of RDDs. Conceptually, it is is not a large data set

Re: Help with processing multiple RDDs

2014-11-11 Thread buring
i think you can try to set lower spark.storage.memoryFraction,for example 0.4 conf.set(spark.storage.memoryFraction,0.4) //default 0.6 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-processing-multiple-RDDs-tp18628p18659.html Sent from

RE: Help with processing multiple RDDs

2014-11-11 Thread Khandeshi, Ami
RDDs Hi, How is 78g distributed in driver, daemon, executor ? Can you please paste the logs regarding that I don't have enough memory to hold the data in memory Are you collecting any data in driver ? Lastly, did you try doing a re-partition to create smaller and evenly distributed

random shuffle streaming RDDs?

2014-11-03 Thread Josh J
Hi, Is there a nice or optimal method to randomly shuffle spark streaming RDDs? Thanks, Josh

Re: random shuffle streaming RDDs?

2014-11-03 Thread Sean Owen
I think the answer will be the same in streaming as in the core. You want a random permutation of an RDD? in general RDDs don't have ordering at all -- excepting when you sort for example -- so a permutation doesn't make sense. Do you just want a well-defined but random ordering of the data? Do

Re: random shuffle streaming RDDs?

2014-11-03 Thread Josh J
When I'm outputting the RDDs to an external source, I would like the RDDs to be outputted in a random shuffle so that even the order is random. So far what I understood is that the RDDs do have a type of order, in that the order for spark streaming RDDs would be the order in which spark streaming

Re: random shuffle streaming RDDs?

2014-11-03 Thread Jay Vyas
A use case would be helpful? Batches of RDDs from Streams are going to have temporal ordering in terms of when they are processed in a typical application ... , but maybe you could shuffle the way batch iterations work On Nov 3, 2014, at 11:59 AM, Josh J joshjd...@gmail.com wrote: When

Re: random shuffle streaming RDDs?

2014-11-03 Thread Sean Owen
a sortBy() on a good hash function of each value plus some salt? (Haven't thought this through much but sounds about right.) On Mon, Nov 3, 2014 at 4:59 PM, Josh J joshjd...@gmail.com wrote: When I'm outputting the RDDs to an external source, I would like the RDDs to be outputted in a random shuffle

Re: random shuffle streaming RDDs?

2014-11-03 Thread Josh J
is guaranteed about that. If you want to permute an RDD, how about a sortBy() on a good hash function of each value plus some salt? (Haven't thought this through much but sounds about right.) On Mon, Nov 3, 2014 at 4:59 PM, Josh J joshjd...@gmail.com wrote: When I'm outputting the RDDs

Re: Manipulating RDDs within a DStream

2014-10-31 Thread lalit1303
-RDDs-within-a-DStream-tp17740p17800.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: Manipulating RDDs within a DStream

2014-10-31 Thread Helena Edelson
Hi Harold, Can you include the versions of spark and spark-cassandra-connector you are using? Thanks! Helena @helenaedelson On Oct 30, 2014, at 12:58 PM, Harold Nguyen har...@nexgate.com wrote: Hi all, I'd like to be able to modify values in a DStream, and then send it off to an

Re: Manipulating RDDs within a DStream

2014-10-31 Thread Harold Nguyen
://apache-spark-user-list.1001560.n3.nabble.com/Manipulating-RDDs-within-a-DStream-tp17740p17800.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr

Re: Manipulating RDDs within a DStream

2014-10-31 Thread Helena Edelson
Hi Harold, Yes, that is the problem :) Sorry for the confusion, I will make this clear in the docs ;) since master is work for the next version. All you need to do is use spark 1.1.0 as you have it already com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1” and assembly - not from

<    1   2   3   4   5   6   >