Re: Manipulating RDDs within a DStream

2014-10-31 Thread Helena Edelson
in context: http://apache-spark-user-list.1001560.n3.nabble.com/Manipulating-RDDs-within-a-DStream-tp17740p17800.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user

sharing RDDs between PySpark and Scala

2014-10-30 Thread rok
I'm processing some data using PySpark and I'd like to save the RDDs to disk (they are (k,v) RDDs of strings and SparseVector types) and read them in using Scala to run them through some other analysis. Is this possible? Thanks, Rok -- View this message in context: http://apache-spark-user

Re: Manipulating RDDs within a DStream

2014-10-30 Thread Harold Nguyen
Hi, Sorry, there's a typo there: val arr = rdd.toArray Harold On Thu, Oct 30, 2014 at 9:58 AM, Harold Nguyen har...@nexgate.com wrote: Hi all, I'd like to be able to modify values in a DStream, and then send it off to an external source like Cassandra, but I keep getting Serialization

Manipulating RDDs within a DStream

2014-10-30 Thread Harold Nguyen
Hi all, I'd like to be able to modify values in a DStream, and then send it off to an external source like Cassandra, but I keep getting Serialization errors and am not sure how to use the correct design pattern. I was wondering if you could help me. I'd like to be able to do the following:

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-29 Thread Gerard Maas
a DStream of single-element RDDs, the element being the count, and - dstream.foreachRDD(_.count()) which returns the count directly. In the first case, some random worker node is chosen for the reduce, in another the driver is chosen for the reduce. There should not be a significant performance

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-29 Thread Tathagata Das
(count, reduce): *Here there is another subtle execution difference between - dstream.count() which produces a DStream of single-element RDDs, the element being the count, and - dstream.foreachRDD(_.count()) which returns the count directly. In the first case, some random worker node is chosen

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-29 Thread jay vyas
- dstream.count() which produces a DStream of single-element RDDs, the element being the count, and - dstream.foreachRDD(_.count()) which returns the count directly. In the first case, some random worker node is chosen for the reduce, in another the driver is chosen for the reduce. There should

combine rdds?

2014-10-27 Thread Josh J
Hi, How could I combine rdds? I would like to combine two RDDs if the count in an RDD is not above some threshold. Thanks, Josh

Re: combine rdds?

2014-10-27 Thread Koert Kuipers
this requires evaluation of the rdd to do the count. val x: RDD[X] = ... val y: RDD[X] = ... x.cache val z = if(x.count thres) x.union(y) else x On Oct 27, 2014 7:51 PM, Josh J joshjd...@gmail.com wrote: Hi, How could I combine rdds? I would like to combine two RDDs if the count in an RDD

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-23 Thread Tathagata Das
between - dstream.count() which produces a DStream of single-element RDDs, the element being the count, and - dstream.foreachRDD(_.count()) which returns the count directly. In the first case, some random worker node is chosen for the reduce, in another the driver is chosen for the reduce

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-22 Thread Gerard Maas
Thanks Matt, Unlike the feared RDD operations on the driver, it's my understanding that these Dstream ops on the driver are merely creating an execution plan for each RDD. My question still remains: Is it better to foreachRDD early in the process or do as much Dstream transformations before going

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-22 Thread Gerard Maas
PS: Just to clarify my statement: Unlike the feared RDD operations on the driver, it's my understanding that these Dstream ops on the driver are merely creating an execution plan for each RDD. With feared RDD operations on the driver I meant to contrast an rdd action like rdd.collect that would

Rdd of Rdds

2014-10-22 Thread Tomer Benyamini
Hello, I would like to parallelize my work on multiple RDDs I have. I wanted to know if spark can support a foreach on an RDD of RDDs. Here's a java example: public static void main(String[] args) { SparkConf sparkConf = new SparkConf().setAppName(testapp

Re: Rdd of Rdds

2014-10-22 Thread Sean Owen
No, there's no such thing as an RDD of RDDs in Spark. Here though, why not just operate on an RDD of Lists? or a List of RDDs? Usually one of these two is the right approach whenever you feel inclined to operate on an RDD of RDDs. On Wed, Oct 22, 2014 at 3:58 PM, Tomer Benyamini tomer

Re: Rdd of Rdds

2014-10-22 Thread Sonal Goyal
Another approach could be to create artificial keys for each RDD and convert to PairRDDs. So your first RDD becomes JavaPairRDDInt,String rdd1 with values 1,1 ; 1,2 and so on Second RDD becomes rdd2 is 2, a; 2, b;2,c You can union the two RDDs, groupByKey, countByKey etc and maybe achieve what

Re: Rdd of Rdds

2014-10-22 Thread Michael Malak
On Wednesday, October 22, 2014 9:06 AM, Sean Owen so...@cloudera.com wrote: No, there's no such thing as an RDD of RDDs in Spark. Here though, why not just operate on an RDD of Lists? or a List of RDDs? Usually one of these two is the right approach whenever you feel inclined to operate

Re: Streams: How do RDDs get Aggregated?

2014-10-21 Thread jay vyas
Hi Spark ! I found out why my RDD's werent coming through in my spark stream. It turns out you need the onStart() needs to return , it seems - i.e. you need to launch the worker part of your start process in a thread. For example def onStartMock():Unit ={ val future = new Thread(new

disk-backing pyspark rdds?

2014-10-21 Thread Eric Jonas
(x) eat_2GB_of_ram() take_2h() return my_100MB_array sc.parallelize(np.arange(100)).map(f).saveAsPickleFile(s3n://blah...) The resulting rdds will most likely not fit in memory but for this use case I don't really care. I know I can persist RDDs, but is there any way to by-default disk

Re: Streams: How do RDDs get Aggregated?

2014-10-21 Thread jay vyas
Oh - and one other note on this, which appears to be the case. If , in your stream forEachRDD implementation, you do something stupid (like call rdd.count()) tweetStream.foreachRDD((rdd,lent)= { tweetStream.repartition(1) numTweetsCollected+=1; //val count = rdd.count()

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-20 Thread Gerard Maas
Pinging TD -- I'm sure you know :-) -kr, Gerard. On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi, We have been implementing several Spark Streaming jobs that are basically processing data and inserting it into Cassandra, sorting it among different keyspaces.

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-20 Thread Matt Narrell
http://spark.apache.org/docs/latest/streaming-programming-guide.html http://spark.apache.org/docs/latest/streaming-programming-guide.html foreachRDD is executed on the driver…. mn On Oct 20, 2014, at 3:07 AM, Gerard Maas gerard.m...@gmail.com wrote: Pinging TD -- I'm sure you know :-)

Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-17 Thread Gerard Maas
Hi, We have been implementing several Spark Streaming jobs that are basically processing data and inserting it into Cassandra, sorting it among different keyspaces. We've been following the pattern: dstream.foreachRDD(rdd = val records = rdd.map(elem = record(elem))

Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread _soumya_
Hi, my programming model requires me to generate multiple RDDs for various datasets across a single run and then run an action on it - E.g. MyFunc myFunc = ... //It implements VoidFunction //set some extra variables - all serializable ... for (JavaRDDString rdd: rddList) { ... sc.foreach(myFunc

Re: Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread _soumya_
Excuse me - the line inside the loop should read: rdd.foreach(myFunc) - not sc. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-an-action-inside-a-loop-across-multiple-RDDs-java-io-NotSerializableException-tp16580p16581.html Sent from the Apache

Re: Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread Cheng Lian
to generate multiple RDDs for various datasets across a single run and then run an action on it - E.g. MyFunc myFunc = ... //It implements VoidFunction //set some extra variables - all serializable ... for (JavaRDDString rdd: rddList) { ... sc.foreach(myFunc); } The problem I'm seeing is that after

Re: Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread _soumya_
); ... } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-an-action-inside-a-loop-across-multiple-RDDs-java-io-NotSerializableException-tp16580p16597.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Is Array Of Struct supported in json RDDs? is it possible to query this?

2014-10-13 Thread shahab
Hello, Given the following structure, is it possible to query, e.g. session[0].id ? In general, is it possible to query Array Of Struct in json RDDs? root |-- createdAt: long (nullable = true) |-- id: string (nullable = true) |-- sessions: array (nullable = true) ||-- element

Re: Is Array Of Struct supported in json RDDs? is it possible to query this?

2014-10-13 Thread Yin Huai
If you are using HiveContext, it should work in 1.1. Thanks, Yin On Mon, Oct 13, 2014 at 5:08 AM, shahab shahab.mok...@gmail.com wrote: Hello, Given the following structure, is it possible to query, e.g. session[0].id ? In general, is it possible to query Array Of Struct in json RDDs

Streams: How do RDDs get Aggregated?

2014-10-11 Thread jay vyas
Hi spark ! I dont quite yet understand the semantics of RDDs in a streaming context very well yet. Are there any examples of how to implement CustomInputDStreams, with corresponding Receivers in the docs ? Ive hacked together a custom stream, which is being opened and is consuming data

One pass compute() to produce multiple RDDs

2014-10-09 Thread Akshat Aranya
Hi, Is there a good way to materialize derivate RDDs from say, a HadoopRDD while reading in the data only once. One way to do so would be to cache the HadoopRDD and then create derivative RDDs, but that would require enough RAM to cache the HadoopRDD which is not an option in my case. Thanks

Re: One pass compute() to produce multiple RDDs

2014-10-09 Thread Sean Owen
. But maybe there is expensive work that happens in between reading the raw data and re-using results, so it's still a win. There's no equivalent of MultipleOutputs. On Thu, Oct 9, 2014 at 10:55 PM, Akshat Aranya aara...@gmail.com wrote: Hi, Is there a good way to materialize derivate RDDs from say

experimental solution to nesting RDDs

2014-09-24 Thread ldmtwo
Spark such as, can it handle 100k or even 10 million stages? Can this clever hacky strategy get around the limitation of only managing RDDs from the driver? Can I iterate over permutations (as with nesting) of an RDD set without calling cartesian() and having memory explosion? I've been using Spark

Change RDDs using map()

2014-09-17 Thread Deep Pradhan
Hi, I want to make the following changes in the RDD (create new RDD from the existing to reflect some transformation): In an RDD of key-value pair, I want to get the keys for which the values are 1. How to do this using map()? Thank You

Re: Change RDDs using map()

2014-09-17 Thread Mark Hamstra
You don't. That's what filter or the partial function version of collect are for: val transformedRDD = yourRDD.collect { case (k, v) if k == 1 = v } On Wed, Sep 17, 2014 at 3:24 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I want to make the following changes in the RDD (create new

RE: Change RDDs using map()

2014-09-17 Thread qihong
. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Change-RDDs-using-map-tp14436p14481.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user

RDDs and Immutability

2014-09-13 Thread Deep Pradhan
Hi, We all know that RDDs are immutable. There are not enough operations that can achieve anything and everything on RDDs. Take for example this: I want an Array of Bytes filled with zeros which during the program should change. Some elements of that Array should change to 1. If I make an RDD

Re: RDDs and Immutability

2014-09-13 Thread Nicholas Chammas
know that RDDs are immutable. There are not enough operations that can achieve anything and everything on RDDs. Take for example this: I want an Array of Bytes filled with zeros which during the program should change. Some elements of that Array should change to 1. If I make an RDD with all

Re: efficient zipping of lots of RDDs

2014-09-11 Thread Mohit Jaggi
filed jira SPARK-3489 https://issues.apache.org/jira/browse/SPARK-3489 On Thu, Sep 4, 2014 at 9:36 AM, Mohit Jaggi mohitja...@gmail.com wrote: Folks, I sent an email announcing https://github.com/AyasdiOpenSource/df This dataframe is basically a map of RDDs of columns(along with DSL

sharing off_heap rdds

2014-09-08 Thread Manku Timma
I see that the tachyon url constructed for an rdd partition has executor id in it. So if the same partition is being processed by a different executor on a reexecution of the same computation, it cannot really use the earlier result. Is this a correct assessment? Will removing the executor id from

Re: Array and RDDs

2014-09-07 Thread Mayur Rustagi
Your question is a bit confusing.. I assume you have a RDD containing nodes some meta data (child nodes maybe) you are trying to attach another metadata to it (bye array). if its just same byte array for all nodes you can generate rdd with the count of nodes zip the two rdd together, you can

Array and RDDs

2014-09-05 Thread Deep Pradhan
Hi, I have an input file which consists of stc_node dest_node I have created and RDD consisting of key-value pair where key is the node id and the values are the children of that node. Now I want to associate a byte with each node. For that I have created a byte array. Every time I print out the

RE: RDDs

2014-09-04 Thread Liu, Raymond
: RE: RDDs Thank you Raymond and Tobias. Yeah, I am very clear about what I was asking. I was talking about replicated rdd only. Now that I've got my understanding about job and application validated, I wanted to know if we can replicate an rdd and run two jobs (that need same rdd

Fwd: RDDs

2014-09-04 Thread rapelly kartheek
-- Forwarded message -- From: rapelly kartheek kartheek.m...@gmail.com Date: Thu, Sep 4, 2014 at 11:49 AM Subject: Re: RDDs To: Liu, Raymond raymond@intel.com Thank you Raymond. I am more clear now. So, if an rdd is replicated over multiple nodes (i.e. say two sets of nodes

Re: RDDs

2014-09-04 Thread Tathagata Das
: Thursday, September 04, 2014 1:24 PM To: u...@spark.incubator.apache.org Subject: RE: RDDs Thank you Raymond and Tobias. Yeah, I am very clear about what I was asking. I was talking about replicated rdd only. Now that I've got my understanding about job and application validated, I wanted

Re: RDDs

2014-09-04 Thread Kartheek.R
Thank you yuanbosoft. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p13444.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e

efficient zipping of lots of RDDs

2014-09-04 Thread Mohit Jaggi
Folks, I sent an email announcing https://github.com/AyasdiOpenSource/df This dataframe is basically a map of RDDs of columns(along with DSL sugar), as column based operations seem to be most common. But row operations are not uncommon. To get rows out of columns right now I zip the column RDDs

RDDs

2014-09-03 Thread rapelly kartheek
Hi, Can someone tell me what kind of operations can be performed on a replicated rdd?? What are the use-cases of a replicated rdd. One basic doubt that is bothering me from long time: what is the difference between an application and job in the Spark parlance. I am confused b'cas of Hadoop

Re: RDDs

2014-09-03 Thread Tobias Pfeiffer
-guide.html#resilient-distributed-datasets-rdds as an introduction, it lists a lot of the transformations and output operations you can use. Personally, I also found it quite helpful to read the paper about RDDs: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf One basic doubt

RE: RDDs

2014-09-03 Thread Liu, Raymond
:03 PM To: user@spark.apache.org Subject: RDDs Hi, Can someone tell me what kind of operations can be performed on a replicated rdd?? What are the use-cases of a replicated rdd. One basic doubt that is bothering me from long time: what is the difference between an application and job in the Spark

RE: RDDs

2014-09-03 Thread Kartheek.R
in parallel?. -Karthk -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p13416.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail

Converting a DStream's RDDs to SchemaRDDs

2014-08-28 Thread Verma, Rishi (398J)
Hi Folks, I’d like to find out tips on how to convert the RDDs inside a Spark Streaming DStream to a set of SchemaRDDs. My DStream contains JSON data pushed over from Kafka, and I’d like to use SparkSQL’s JSON import function (i.e. jsonRDD) to register the JSON dataset as a table, and perform

Re: Converting a DStream's RDDs to SchemaRDDs

2014-08-28 Thread Tathagata Das
like to find out tips on how to convert the RDDs inside a Spark Streaming DStream to a set of SchemaRDDs. My DStream contains JSON data pushed over from Kafka, and I’d like to use SparkSQL’s JSON import function (i.e. jsonRDD) to register the JSON dataset as a table, and perform queries

Replicate RDDs

2014-08-27 Thread rapelly kartheek
Hi I have a three node spark cluster. I restricted the resources per application by setting appropriate parameters and I could run two applications simultaneously. Now, I want to replicate an RDD and run two applications simultaneously. Can someone help how to go about doing this!!! I replicated

Re: Out of memory on large RDDs

2014-08-27 Thread Jianshi Huang
-list.1001560.n3.nabble.com/Re-Out-of-memory-on-large-RDDs-tp2533p2534.html To start a new topic under Apache Spark User List, email [hidden email] http://user/SendEmail.jtp?type=nodenode=2537i=1 To unsubscribe from Apache Spark User List, click here. NAML http://apache-spark-user-list.1001560

Re: Losing Executors on cluster with RDDs of 100GB

2014-08-26 Thread MEETHU MATHEW
mode. Previously my application was working well ( several RDDs the largest being around 50G). When I started processing larger amounts of data (RDDs of 100G) my app is losing executors. Im currently just loading them from a database, rePartitioning and persisting to disk (with replication x2) I

Re: Printing the RDDs in SparkPageRank

2014-08-26 Thread Deep Pradhan
println(parts(0)) does not solve the problem. It does not work On Mon, Aug 25, 2014 at 1:30 PM, Sean Owen so...@cloudera.com wrote: On Mon, Aug 25, 2014 at 7:18 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: When I add parts(0).collect().foreach(println)

Re: Out of memory on large RDDs

2014-08-26 Thread Andrew Ash
to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Re-Out-of-memory-on-large-RDDs-tp2533p2534.html To start a new topic under Apache Spark User List, email [hidden email] http://user/SendEmail.jtp?type=nodenode=2537i=1 To unsubscribe from Apache Spark User List, click

Re: Printing the RDDs in SparkPageRank

2014-08-25 Thread Deep Pradhan
the SparkPageRank code and want to see the intermediate steps, like the RDDs formed in the intermediate steps. Here is a part of the code along with the lines that I added in order to print the RDDs. I want to print the *parts* in the code (denoted by the comment in Bold letters). But, when I try to do

Re: Printing the RDDs in SparkPageRank

2014-08-25 Thread Sean Owen
On Mon, Aug 25, 2014 at 7:18 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: When I add parts(0).collect().foreach(println) parts(1).collect().foreach(println), for printing parts, I get the following error not enough arguments for method collect: (pf: PartialFunction[Char,B])(implicit

Printing the RDDs in SparkPageRank

2014-08-24 Thread Deep Pradhan
Hi, I was going through the SparkPageRank code and want to see the intermediate steps, like the RDDs formed in the intermediate steps. Here is a part of the code along with the lines that I added in order to print the RDDs. I want to print the *parts* in the code (denoted by the comment in Bold

Re: Printing the RDDs in SparkPageRank

2014-08-24 Thread Jörn Franke
Hi, What kind of error do you receive? Best regards, Jörn Le 24 août 2014 08:29, Deep Pradhan pradhandeep1...@gmail.com a écrit : Hi, I was going through the SparkPageRank code and want to see the intermediate steps, like the RDDs formed in the intermediate steps. Here is a part

Losing Executors on cluster with RDDs of 100GB

2014-08-22 Thread Yadid Ayzenberg
Hi all, I have a spark cluster of 30 machines, 16GB / 8 cores on each running in standalone mode. Previously my application was working well ( several RDDs the largest being around 50G). When I started processing larger amounts of data (RDDs of 100G) my app is losing executors. Im currently

Working with many RDDs in parallel?

2014-08-18 Thread David Tinker
Hi All. I need to create a lot of RDDs starting from a set of roots and count the rows in each. Something like this: final JavaSparkContext sc = new JavaSparkContext(conf); ListString roots = ... MapString, Object res = sc.parallelize(roots).mapToPair(new PairFunctionString, String, Long

Re: Working with many RDDs in parallel?

2014-08-18 Thread Sean Owen
You won't be able to use RDDs inside of RDD operation. I imagine your immediate problem is that the code you've elided references 'sc' and that gets referenced by the PairFunction and serialized, but it can't be. If you want to play it this way, parallelize across roots in Java. That is just use

Re: Working with many RDDs in parallel?

2014-08-18 Thread David Tinker
PM, Sean Owen so...@cloudera.com wrote: You won't be able to use RDDs inside of RDD operation. I imagine your immediate problem is that the code you've elided references 'sc' and that gets referenced by the PairFunction and serialized, but it can't be. If you want to play it this way

Re:[GraphX] Can't zip RDDs with unequal numbers of partitions

2014-08-07 Thread Bin
number, GraphX jobs will throw: java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions So my quick fix is to repartition the EdgeRDD to exactly the number of parallelism. But I think this would lead to much network communication. So is there any other better

Can't zip RDDs with unequal numbers of partitions

2014-08-05 Thread Bin
Hi All, I met the titled error. This exception occured in line 223, as shown below: 212 // read files 213 val lines = sc.textFile(path_edges).map(line=line.split(,)).map(line=((line(0), line(1)), line(2).toDouble)).reduceByKey(_+ _).cache 214 215 val

Is deferred execution of multiple RDDs ever coming?

2014-07-21 Thread Harry Brundage
across RDDs and have spark execute them efficiently. Our use case for a feature like this is processing many records and attaching metadata to the records during processing about our confidence in the data-points, and then writing the data to one spot and the metadata to another spot. I've also wanted

Re: Stateful RDDs?

2014-07-14 Thread Tathagata Das
RDDs are touched in every batch. Since spark streaming is not really a dedicated data store, its not really designed to separate out hot data and cold data. 2. For each key, in the state you could maintain a timestamp of when it was updated and accordingly return None to filter that state out

Stateful RDDs?

2014-07-10 Thread Sargun Dhillon
. Additionally, given that my data is partitionable by datetime, does it make sense to have a custom datetime partitioner, and just persist the dstream to disk, to ensure that its RDDs are only pulled off of disk (into memory) occasionally? What's the cost of having a bunch of relatively large, stateful RDDs

SparkSQL with sequence file RDDs

2014-07-07 Thread Gary Malouf
Has anyone reported issues using SparkSQL with sequence files (all of our data is in this format within HDFS)? We are considering whether to burn the time upgrading to Spark 1.0 from 0.9 now and this is a main decision point for us.

Re: SparkSQL with sequence file RDDs

2014-07-07 Thread Michael Armbrust
I haven't heard any reports of this yet, but I don't see any reason why it wouldn't work. You'll need to manually convert the objects that come out of the sequence file into something where SparkSQL can detect the schema (i.e. scala case classes or java beans) before you can register the RDD as a

RE: SparkSQL with sequence file RDDs

2014-07-07 Thread Haoming Zhang
: Mon, 7 Jul 2014 17:12:42 -0700 Subject: Re: SparkSQL with sequence file RDDs To: user@spark.apache.org I haven't heard any reports of this yet, but I don't see any reason why it wouldn't work. You'll need to manually convert the objects that come out of the sequence file into something where

Re: SparkSQL with sequence file RDDs

2014-07-07 Thread Michael Armbrust
We know Scala 2.11 has remove the limitation of parameter number, but Spark 1.0 is not compatible with it. So now we are considering use java beans instead of Scala case classes. You can also manually create a class that implements scala's Product interface. Finally, SPARK-2179

RE: SparkSQL with sequence file RDDs

2014-07-07 Thread Haoming Zhang
...@databricks.com Date: Mon, 7 Jul 2014 17:52:34 -0700 Subject: Re: SparkSQL with sequence file RDDs To: user@spark.apache.org We know Scala 2.11 has remove the limitation of parameter number, but Spark 1.0 is not compatible with it. So now we are considering use java beans instead of Scala case

Re: SparkSQL with sequence file RDDs

2014-07-07 Thread Michael Armbrust
-- From: mich...@databricks.com Date: Mon, 7 Jul 2014 17:52:34 -0700 Subject: Re: SparkSQL with sequence file RDDs To: user@spark.apache.org We know Scala 2.11 has remove the limitation of parameter number, but Spark 1.0 is not compatible with it. So now we

Re: Callbacks on freeing up of RDDs

2014-07-02 Thread Mayur Rustagi
A lot of RDD that you create in Code may not even be constructed as the tasks layer is optimized in the DAG scheduler.. The closest is onUnpersistRDD in SparkListner. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon,

Callbacks on freeing up of RDDs

2014-06-30 Thread Jaideep Dhok
Hi all, I am trying to create a custom RDD class for result set of queries supported in InMobi Grill (http://inmobi.github.io/grill/) Each result set has a schema (similar to Hive's TableSchema) and a path in HDFS containing the result set data. An easy way of doing this would be to create a

Re: balancing RDDs

2014-06-25 Thread Sean McNamara
Yep exactly! I’m not sure how complicated it would be to pull off. If someone wouldn’t mind helping to get me pointed in the right direction I would be happy to look into and contribute this functionality. I imagine this would be implemented in the scheduler codebase and there would be some

Re: balancing RDDs

2014-06-24 Thread Mayur Rustagi
This would be really useful. Especially for Shark where shift of partitioning effects all subsequent queries unless task scheduling time beats spark.locality.wait. Can cause overall low performance for all subsequent tasks. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

balancing RDDs

2014-06-23 Thread Sean McNamara
We have a use case where we’d like something to execute once on each node and I thought it would be good to ask here. Currently we achieve this by setting the parallelism to the number of nodes and use a mod partitioner: val balancedRdd = sc.parallelize( (0 until Settings.parallelism)

Re: Spark streaming RDDs to Parquet records

2014-06-19 Thread Anita Tailor
-contractor.com wrote: Thanks Krishna. Seems like you have to use Avro and then convert that to Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look into this some more. Thanks, Mahesh From: Krishna Sankar ksanka...@gmail.com Reply-To: user@spark.apache.org user

Re: Spark streaming RDDs to Parquet records

2014-06-19 Thread contractor
RDDs to Parquet records I have similar case where I have RDD [List[Any], List[Long] ] and wants to save it as Parquet file. My understanding is that only RDD of case classes can be converted to SchemaRDD. So is there any way I can save this RDD as Parquet file without using Avro? Thanks

Re: Spark streaming RDDs to Parquet records

2014-06-19 Thread Anita Tailor
Tailor [via Apache Spark User List] [hidden email] http://user/SendEmail.jtp?type=nodenode=7971i=0 Date: Thursday, June 19, 2014 at 12:53 PM To: Mahesh Padmanabhan [hidden email] http://user/SendEmail.jtp?type=nodenode=7971i=1 Subject: Re: Spark streaming RDDs to Parquet records I have

Spark streaming RDDs to Parquet records

2014-06-17 Thread maheshtwc
Hello, Is there an easy way to convert RDDs within a DStream into Parquet records? Here is some incomplete pseudo code: // Create streaming context val ssc = new StreamingContext(...) // Obtain a DStream of events val ds = KafkaUtils.createStream(...) // Get Spark context to get to the SQL

Re: Spark streaming RDDs to Parquet records

2014-06-17 Thread Krishna Sankar
, maheshtwc mahesh.padmanab...@twc-contractor.com wrote: Hello, Is there an easy way to convert RDDs within a DStream into Parquet records? Here is some incomplete pseudo code: // Create streaming context val ssc = new StreamingContext(...) // Obtain a DStream of events val ds

Re: Spark streaming RDDs to Parquet records

2014-06-17 Thread contractor
Thanks Krishna. Seems like you have to use Avro and then convert that to Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look into this some more. Thanks, Mahesh From: Krishna Sankar ksanka...@gmail.commailto:ksanka...@gmail.com Reply-To: user@spark.apache.orgmailto:user

Re: Spark streaming RDDs to Parquet records

2014-06-17 Thread Michael Armbrust
to use Avro and then convert that to Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look into this some more. Thanks, Mahesh From: Krishna Sankar ksanka...@gmail.com Reply-To: user@spark.apache.org user@spark.apache.org Date: Tuesday, June 17, 2014 at 2:41 PM

list of persisted rdds

2014-06-13 Thread mrm
Hi, How do I check the rdds that I have persisted? I have some code that looks like: rd1.cache() rd2.cache() ... rdN.cache() How can I unpersist all rdd's at once? And is it possible to get the names of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)? Thank you

list of persisted rdds

2014-06-13 Thread mrm
Hi, How do I check the rdds that I have persisted? I have some code that looks like: rd1.cache() rd2.cache() ... rdN.cache() How can I unpersist all rdd's at once? And is it possible to get the names of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)? Thank you

Re: list of persisted rdds

2014-06-13 Thread Daniel Darabos
Check out SparkContext.getPersistentRDDs! On Fri, Jun 13, 2014 at 1:06 PM, mrm ma...@skimlinks.com wrote: Hi, How do I check the rdds that I have persisted? I have some code that looks like: rd1.cache() rd2.cache() ... rdN.cache() How can I unpersist all rdd's at once

Re: list of persisted rdds

2014-06-13 Thread mrm
appreciate it if you could help me with this, I have tried different ways and googling it! I suspect it might be a silly error but I can't figure it out. Maria -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564p7569.html Sent from

Re: list of persisted rdds

2014-06-13 Thread Mayur Rustagi
-of-persisted-rdds-tp7564p7569.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: list of persisted rdds

2014-06-13 Thread Nicholas Chammas
different ways and googling it! I suspect it might be a silly error but I can't figure it out. Maria -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564p7569.html Sent from the Apache Spark User List mailing list archive

Re: list of persisted rdds

2014-06-13 Thread mrm
Hi Nick, Thank you for the reply, I forgot to mention I was using pyspark in my first message. Maria -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564p7581.html Sent from the Apache Spark User List mailing list archive

Re: list of persisted rdds

2014-06-13 Thread Nicholas Chammas
this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564p7581.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Merging all Spark Streaming RDDs to one RDD

2014-06-09 Thread Henggang Cui
Hi, I'm wondering whether it's possible to continuously merge the RDDs coming from a stream into a single RDD efficiently. One thought is to use the union() method. But using union, I will get a new RDD each time I do a merge. I don't know how I should name these RDDs, because I remember Spark

Re: How to create RDDs from another RDD?

2014-06-03 Thread Gerard Maas
to the RDD data, so the potential easiest way is solve the problem at hand is to create several RDDs from the original RDD. The issue I see is that the 'sc.makeRDD(v.toSeq)' will potentially blow when trying to materialize the iterator into a seq. I also don't know what the behaviour of that call

Re: How to create RDDs from another RDD?

2014-06-03 Thread Andrew Ash
the problem at hand is to create several RDDs from the original RDD. The issue I see is that the 'sc.makeRDD(v.toSeq)' will potentially blow when trying to materialize the iterator into a seq. I also don't know what the behaviour of that call to SparkContext will be on a remote worker. My

How to create RDDs from another RDD?

2014-06-02 Thread Gerard Maas
The RDD API has functions to join multiple RDDs, such as PariRDD.join or PariRDD.cogroup that take another RDD as input. e.g. firstRDD.join(secondRDD) I'm looking for ways to do the opposite: split an existing RDD. What is the right way to create derivate RDDs from an existing RDD? e.g

<    1   2   3   4   5   6   >