DataFrames in Spark - Performance when interjected with RDDs

2015-09-07 Thread Pallavi Rao
Hello All, I had a question regarding the performance optimization (Catalyst Optimizer) of DataFrames. I understand that DataFrames are interoperable with RDDs. If I switch back and forth between DataFrames and RDDs, does the performance optimization still kick-in? I need to switch to RDDs to

Re: Parallel execution of RDDs

2015-08-31 Thread Brian Parker
Thank you for the comments. As you mentioned, increasing the thread pool succeeded to allow more parallel jobs and decreasing #partitions allowed more RDDs to execute in parallel. Much appreciated On Aug 31, 2015 7:07 AM, "Igor Berman" wrote: > what is size of the pool you submitti

Re: Parallel execution of RDDs

2015-08-31 Thread Igor Berman
what is size of the pool you submitting spark jobs from(futures you've mentioned)? is it 8? I think you have fixed thread pool of 8 so there can't be more than 8 parallel jobs running...so try to increase it what is number of partitions of each of your rdds? how many cores has your work

Parallel execution of RDDs

2015-08-31 Thread Brian Parker
Hi, I have a large number of RDDs that I need to process separately. Instead of submitting these jobs to the Spark scheduler one by one, I'd like to submit them in parallel in order to maximize cluster utilization. I've tried to process the RDDs as Futures, but the number of Active jobs

Re: How to list all dataframes and RDDs available in current session?

2015-08-24 Thread Dhaval Gmail
text... >> On Aug 21, 2015 12:06 AM, "Rishitesh Mishra" >> wrote: >> I am not sure if you can view all RDDs in a session. Tables are maintained >> in a catalogue . Hence its easier. However you can see the DAG >> representation , which lists all the RD

Re: How to list all dataframes and RDDs available in current session?

2015-08-21 Thread Raghavendra Pandey
You get the list of all the persistet rdd using spark context... On Aug 21, 2015 12:06 AM, "Rishitesh Mishra" wrote: > I am not sure if you can view all RDDs in a session. Tables are maintained > in a catalogue . Hence its easier. However you can see the DAG > representatio

Re: How to list all dataframes and RDDs available in current session?

2015-08-20 Thread Rishitesh Mishra
I am not sure if you can view all RDDs in a session. Tables are maintained in a catalogue . Hence its easier. However you can see the DAG representation , which lists all the RDDs in a job , with Spark UI. On 20 Aug 2015 22:34, "Dhaval Patel" wrote: > Apologies > > I ac

Re: How to list all dataframes and RDDs available in current session?

2015-08-20 Thread Dhaval Patel
Apologies I accidentally included Spark User DL on BCC. The actual email message is below. = Hi: I have been working on few example using zeppelin. I have been trying to find a command that would list all *dataframes/RDDs* that

How to list all dataframes and RDDs available in current session?

2015-08-20 Thread Dhaval Patel
Hi: I have been working on few example using zeppelin. I have been trying to find a command that would list all *dataframes/RDDs* that has been created in current session. Anyone knows if there is any such commands available? Something similar to SparkSQL to list all temp tables : show

Re: Encryption on RDDs or in-memory/cache on Apache Spark

2015-08-02 Thread Jörn Franke
, Matthew O'Reilly a écrit : > Hi, > > I am currently working on the latest version of Apache Spark (1.4.1), > pre-built package for Hadoop 2.6+. > > Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory/cache > (something similar is Altibase's HDB: > http

Re: Encryption on RDDs or in-memory/cache on Apache Spark

2015-08-02 Thread Akhil Das
Currently RDDs are not encrypted, I think you can go ahead and open a JIRA to add this feature and may be in future release it could be added. Thanks Best Regards On Fri, Jul 31, 2015 at 1:47 PM, Matthew O'Reilly wrote: > Hi, > > I am currently working on the latest version o

Encryption on RDDs or in-memory/cache on Apache Spark

2015-07-31 Thread Matthew O'Reilly
Hi, I am currently working on the latest version of Apache Spark (1.4.1), pre-built package for Hadoop 2.6+. Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory/cache (something similar is Altibase's HDB:  http://altibase.com/in-memory-database-computing-solutions/sec

Re: RDDs join problem: incorrect result

2015-07-28 Thread ๏̯͡๏
> http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-join-problem-incorrect-result-tp19928p24049.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user

Re: RDDs join problem: incorrect result

2015-07-28 Thread ponkin
Hi, Alice Did you find solution? I have exactly the same problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-join-problem-incorrect-result-tp19928p24049.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-27 Thread Xiangrui Meng
Hi Stahlman, finalRDDStorageLevel is the storage level for the final user/item factors. It is not common to set it to StorageLevel.NONE, unless you want to save the factors directly to disk. So if it is NONE, we cannot unpersist the intermediate RDDs (in/out blocks) because the final user/item

Re: Encryption on RDDs or in-memory on Apache Spark

2015-07-27 Thread Akhil Das
Fri, Jul 24, 2015 at 2:12 PM, IASIB1 wrote: > I am currently working on the latest version of Apache Spark (1.4.1), > pre-built package for Hadoop 2.6+. > > Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory > (similarly > to Altibase's HDB: > http://al

Encryption on RDDs or in-memory on Apache Spark

2015-07-24 Thread IASIB1
I am currently working on the latest version of Apache Spark (1.4.1), pre-built package for Hadoop 2.6+. Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory (similarly to Altibase's HDB: http://altibase.com/in-memory-database-computing-solutions/security/ <http://altibas

Re: How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread Haoyuan Li
Yes. Tachyon can handle this well: http://tachyon-project.org/ Best, Haoyuan On Wed, Jul 22, 2015 at 10:56 AM, swetha wrote: > Hi, > > We have a requirement wherein we need to keep RDDs in memory between Spark > batch processing that happens every one hour. The idea here is

Re: How to share a Map among RDDS?

2015-07-22 Thread Dan Dong
abe/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala >>> . >>> >>> -Andrew >>> >>> 2015-07-21 19:56 GMT-07:00 ayan guha : >>> >>>> Either you have to do rdd.collect and then broadcast or you can do a >>>&

Re: How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread harirajaram
I was about say whatever the previous post said,so +1 to the previous post,from my understanding (gut feeling) of your requirement it very easy to do this with spark-job-server. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs-in-memory

Re: How to share a Map among RDDS?

2015-07-22 Thread Andrew Or
e to do rdd.collect and then broadcast or you can do a join >>> On 22 Jul 2015 07:54, "Dan Dong" wrote: >>> >>>> Hi, All, >>>> >>>> >>>> I am trying to access a Map from RDDs that are on different compute >>>

Re: How to share a Map among RDDS?

2015-07-22 Thread Dan Dong
do rdd.collect and then broadcast or you can do a join >> On 22 Jul 2015 07:54, "Dan Dong" wrote: >> >>> Hi, All, >>> >>> >>> I am trying to access a Map from RDDs that are on different compute >>> nodes, but without success. The

Re: How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread ericacm
Actually, I should clarify - Tachyon is a way to keep your data in RAM, but it's not exactly the same as keeping it cached in Spark. Spark Job Server is a way to keep it cached in Spark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RD

Re: How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread ericacm
Tachyon is one way. Also check out the Spark Job Server <https://github.com/spark-jobserver/spark-jobserver> . -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs-in-memory-between-two-different-batch-jobs-tp23957p23958.html Sent fr

RE: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-22 Thread Ganelin, Ilya
talone.com<mailto:jonathan.stahl...@capitalone.com>] Sent: Wednesday, July 22, 2015 01:42 PM Eastern Standard Time To: user@spark.apache.org Subject: Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel Hello again, In trying to understand the caching of intermediate RDDs by ALS, I looked into

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-22 Thread Stahlman, Jonathan
Hi Burak, Looking at the source code, the intermediate RDDs used in ALS.train() are persisted during the computation using intermediateRDDStorageLevel (default value is StorageLevel.MEMORY_AND_DISK) - see here<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark

How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread swetha
Hi, We have a requirement wherein we need to keep RDDs in memory between Spark batch processing that happens every one hour. The idea here is to have RDDs that have active user sessions in memory between two jobs so that once a job processing is done and another job is run after an hour the RDDs

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-22 Thread Burak Yavuz
l 22, 2015 at 10:38 AM, Stahlman, Jonathan < jonathan.stahl...@capitalone.com> wrote: > Hello again, > > In trying to understand the caching of intermediate RDDs by ALS, I looked > into the source code and found what may be a bug. Looking here: > > > https://github.com/ap

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-22 Thread Stahlman, Jonathan
Hello again, In trying to understand the caching of intermediate RDDs by ALS, I looked into the source code and found what may be a bug. Looking here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L230 you see that ALS.train

Re: Which memory fraction is Spark using to compute RDDs that are not going to be persisted

2015-07-22 Thread Andrew Or
27;s the most memory intensive. -Andrew 2015-07-21 13:47 GMT-07:00 wdbaruni : > I am new to Spark and I understand that Spark divides the executor memory > into the following fractions: > > *RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or > .ca

Re: How to share a Map among RDDS?

2015-07-22 Thread Andrew Or
rg/apache/spark/examples/BroadcastTest.scala . -Andrew 2015-07-21 19:56 GMT-07:00 ayan guha : > Either you have to do rdd.collect and then broadcast or you can do a join > On 22 Jul 2015 07:54, "Dan Dong" wrote: > >> Hi, All, >> >> >> I am trying to acc

Re: How to share a Map among RDDS?

2015-07-21 Thread ayan guha
Either you have to do rdd.collect and then broadcast or you can do a join On 22 Jul 2015 07:54, "Dan Dong" wrote: > Hi, All, > > > I am trying to access a Map from RDDs that are on different compute nodes, > but without success. The Map is like: > > val map

How to share a Map among RDDS?

2015-07-21 Thread Dan Dong
Hi, All, I am trying to access a Map from RDDs that are on different compute nodes, but without success. The Map is like: val map1 = Map("aa"->1,"bb"->2,"cc"->3,...) All RDDs will have to check against it to see if the key is in the Map or not, so see

Which memory fraction is Spark using to compute RDDs that are not going to be persisted

2015-07-21 Thread wdbaruni
I am new to Spark and I understand that Spark divides the executor memory into the following fractions: *RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or .cache() and can be defined by setting spark.storage.memoryFraction (default 0.6) *Shuffle and aggregation buffers

How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-16 Thread Stahlman, Jonathan
. A sample code in python is copied below. The issue I have is that each new model which is trained caches a set of RDDs and eventually the executors run out of memory. Is there any way in Pyspark to unpersist() these RDDs after each iteration? The names of the RDDs which I gather from the UI

Re: Running foreach on a list of rdds in parallel

2015-07-15 Thread Vetle Leinonen-Roeim
On Thu, Jul 16, 2015 at 7:37 AM Brandon White wrote: > Hello, > > I have a list of rdds > > List(rdd1, rdd2, rdd3,rdd4) > > I would like to save these rdds in parallel. Right now, it is running each > operation sequentially. I tried using a rdd of rdd but that does no

Re: Running foreach on a list of rdds in parallel

2015-07-15 Thread Davies Liu
sc.union(rdds).saveAsTextFile() On Wed, Jul 15, 2015 at 10:37 PM, Brandon White wrote: > Hello, > > I have a list of rdds > > List(rdd1, rdd2, rdd3,rdd4) > > I would like to save these rdds in parallel. Right now, it is running each > operation sequentially. I tried usi

Running foreach on a list of rdds in parallel

2015-07-15 Thread Brandon White
Hello, I have a list of rdds List(rdd1, rdd2, rdd3,rdd4) I would like to save these rdds in parallel. Right now, it is running each operation sequentially. I tried using a rdd of rdd but that does not work. list.foreach { rdd => rdd.saveAsTextFile("/tmp/cache/") } Any ideas?

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Ted Yu
in my Spark Streaming program >>> (Java): >>> >>> dStream.foreachRDD((rdd, batchTime) -> { >>> log.info("processing RDD from batch {}", batchTime); >>> >>> // my rdd processing code >>>

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread N B
t;> >> Instead of having my rdd processing code called once for each RDD in the >> batch, is it possible to essentially group all of the RDDs from the batch >> into a single RDD and single partition and therefore operate on all of the >> elements in the batch at once? &g

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Jon Chase
called once for each RDD in the > batch, is it possible to essentially group all of the RDDs from the batch > into a single RDD and single partition and therefore operate on all of the > elements in the batch at once? > > My goal here is to do an operation exactly once for every bat

Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Jon Chase
stead of having my rdd processing code called once for each RDD in the batch, is it possible to essentially group all of the RDDs from the batch into a single RDD and single partition and therefore operate on all of the elements in the batch at once? My goal here is to do an operation exactly on

Re: correct Scala Imports for creating DFs from RDDs?

2015-07-14 Thread DW @ Gmail
ailable"? > > Also, what are the correct imports to get this working? > > I'm using sbt assembly to try to compile these files, and would really > appreciate any help. > > Thanks, > Ashley Wang > > > > -- > View this message in context: > ht

correct Scala Imports for creating DFs from RDDs?

2015-07-14 Thread ashwang168
y Wang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/correct-Scala-Imports-for-creating-DFs-from-RDDs-tp23829.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To

Re: Are Spark Streaming RDDs always processed in order?

2015-07-06 Thread Khaled Hammouda
>> From: Tathagata Das >> Date: 20 June 2015 at 17:21 >> Subject: Re: Serial batching with Spark Streaming >> To: Michal Čizmazia >> Cc: Binh Nguyen Van , user >> >> >> No it does not. By default, only after all the retries etc related to >> ba

Re: Are Spark Streaming RDDs always processed in order?

2015-07-06 Thread Tathagata Das
tarted. > > Yes, one RDD per batch per DStream. However, the RDD could be a union of > multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned > DStream). > > TD > > On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia > wrote: > Thanks Tathagata! > > I wil

Re: Are Spark Streaming RDDs always processed in order?

2015-07-04 Thread Michal Čizmazia
: Re: Serial batching with Spark Streaming To: Michal Čizmazia Cc: Binh Nguyen Van , user No it does not. By default, only after all the retries etc related to batch X is done, then batch X+1 will be started. Yes, one RDD per batch per DStream. However, the RDD could be a union of multiple RDDs

Re: Are Spark Streaming RDDs always processed in order?

2015-07-03 Thread Raghavendra Pandey
ck of > messages, i.e. no need to ack one-by-one, but only ack the last event in a > batch and that would ack the entire batch. > > Before I commit to doing so, I'd like to know if Spark Streaming always > processes RDDs in the same order they arrive in, i.e. if RDD1 arrives &

Are Spark Streaming RDDs always processed in order?

2015-07-03 Thread khaledh
ng so, I'd like to know if Spark Streaming always processes RDDs in the same order they arrive in, i.e. if RDD1 arrives before RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is finished? This is crucial to the ack logic, since if RDD2 can be potentially processed whi

Re: Union of many RDDs taking a long time

2015-06-29 Thread Tomasz Fruboes
SparkContext once, as soon as you have all RDDs ready. For python it looks this way: rdds = [] for i in xrange(cnt): rdd = ... rdds.append(rdd) finalRDD = sparkContext.union(rdds) HTH, Tomasz W dniu 18.06.2015 o 02:53, Matt Forbes pisze: I have multiple input paths which

Union of many RDDs taking a long time

2015-06-17 Thread Matt Forbes
: rdd.union(nextRdd); rdd = rdd.coalesce(nextRdd.partitions().size()); } Now, for a small number of inputs there doesn't seem to be a problem, but for the full set which is about 60 sub-RDDs coming in at around 500MM total records takes a very long time to construct. Just for a simple load

Re: RDD of RDDs

2015-06-10 Thread ping yan
Thanks much for the detailed explanations. I suspected architectural support of the notion of rdd of rdds, but my understanding of Spark or distributed computing in general is not as deep as allowing me to understand better. so this really helps! I ended up going with List[RDD]. The collection of

Re: RDD of RDDs

2015-06-09 Thread kiran lonikar
>> >> On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar wrote: >> >>> Simillar question was asked before: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html >>> >>> Here is one of the reasons why I think RDD[RDD[T]] is not pos

Re: RDD of RDDs

2015-06-09 Thread Mark Hamstra
n or action APIs of > RDD), it will be possible to have RDD of RDD. > > On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar wrote: > >> Simillar question was asked before: >> http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html >> >> Here

Re: RDD of RDDs

2015-06-09 Thread kiran lonikar
; http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html > > Here is one of the reasons why I think RDD[RDD[T]] is not possible: > >- RDD is only a handle to the actual data partitions. It has a >reference/pointer to the *SparkContext* object (*sc*) and a li

Re: Rdd of Rdds

2015-06-09 Thread lonikar
rk job. Hope it helps. You need to consider List[RDD] or some other collection. Possibly in future, if and when spark architecture allows workers to launch spark jobs (the functions passed to transformation or action APIs of RDD), it will be possible to have RDD of RDD. -- View this messa

Re: RDD of RDDs

2015-06-09 Thread kiran lonikar
Simillar question was asked before: http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html Here is one of the reasons why I think RDD[RDD[T]] is not possible: - RDD is only a handle to the actual data partitions. It has a reference/pointer to the *SparkContext* object

Re: How does lineage get passed down in RDDs

2015-06-08 Thread maxdml
ge in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-does-lineage-get-passed-down-in-RDDs-tp23196p23212.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e

RDD of RDDs

2015-06-08 Thread ping yan
Hi, The problem I am looking at is as follows: - I read in a log file of multiple users as a RDD - I'd like to group the above RDD into *multiple RDDs* by userIds (the key) - my processEachUser() function then takes in each RDD mapped into each individual user, and calls for RDD.m

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread kiran lonikar
around RDD. As another interest, I wanted check if some of the DF execution functions can be executed on GPUs. For that to happen, the columnar layout is important. Here is where DF scores over ordinary RDDs. Seems like the batch size defined by spark.sql.inMemoryColumnarStorage.batchSize is set to

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread Cheng Lian
You may refer to DataFrame Scaladoc http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame Methods listed in "Language Integrated Queries" and "RDD Options" can be viewed as "transformations", and those listed in "Actions" are, of course, actions. As for SQLCo

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread ayan guha
I would think DF=RDD+Schema+some additional methods. In fact, a DF object has a DF.rdd in it so you can (if needed) convert DF<=>RDD really easily. On Mon, Jun 8, 2015 at 5:41 PM, kiran lonikar wrote: > Thanks. Can you point me to a place in the documentation of SQL > programming guide or DataFr

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread kiran lonikar
Thanks. Can you point me to a place in the documentation of SQL programming guide or DataFrame scaladoc where this transformation and actions are grouped like in the case of RDD? Also if you can tell me if sqlContext.load and unionAll are transformations or actions... I answered a question on the

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread Cheng Lian
For DataFrame, there are also transformations and actions. And transformations are also lazily evaluated. However, DataFrame transformations like filter(), select(), agg() return a DataFrame rather than an RDD. Other methods like show() and collect() are actions. Cheng On 6/8/15 1:33 PM, kira

Re: Column operation on Spark RDDs.

2015-06-08 Thread lonikar
;)) val dt = dataRDD.*zipWithUniqueId*.map(_.swap) val newCol1 = *dt*.map {case (i, x) => (i, x(1)+x(18)) } val newCol2 = newCol1.join(dt).map(x=> function(.)) Hope this helps. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Column-operation-on-Spark-RDDs-tp

Re: Column operation on Spark RDDs.

2015-06-08 Thread kiran lonikar
,")) > val dt = dataRDD.zipWithIndex.map(_.swap) > val newCol1 = dataRDD.map(x => x(1)+x(18)).zipWithIndex.map(_.swap) > val newCol2 = newCol1.join(dt).map(x=> function(.)) > > Is there a better way of doing this? > > Thank you very much! > > > > >

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-07 Thread kiran lonikar
Thanks for replying twice :) I think I sent this question by email and somehow thought I did not sent it, hence created the other one on the web interface. Lets retain this thread since you have provided more details here. Great, it confirms my intuition about DataFrame. It's similar to Shark colu

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-07 Thread Cheng Lian
Interesting, just posted on another thread asking exactly the same question :) My answer there quoted below: > For the following code: > > val df = sqlContext.parquetFile(path) > > `df` remains columnar (actually it just reads from the columnar Parquet file on disk). For the following code:

Re: Does Apache Spark maintain a columnar structure when creating RDDs from Parquet or ORC files?

2015-06-07 Thread Cheng Lian
ze rows? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Apache-Spark-maintain-a-columnar-structure-when-creating-RDDs-from-Parquet-or-ORC-files-tp23139.html Sent from the Apache Spark User List mailing li

Column operation on Spark RDDs.

2015-06-04 Thread Carter
abble.com/Column-operation-on-Spark-RDDs-tp23165.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Does Apache Spark maintain a columnar structure when creating RDDs from Parquet or ORC files?

2015-06-03 Thread lonikar
n df.cache().map{row => ...}? Is it a logical row which maintains an array of columns and each column in turn is an array of values for batchSize rows? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Apache-Spark-maintain-a-columnar-structure-when

columnar structure of RDDs from Parquet or ORC files

2015-06-03 Thread kiran lonikar
When spark reads parquet files (sqlContext.parquetFile), it creates a DataFrame RDD. I would like to know if the resulting DataFrame has columnar structure (many rows of a column coalesced together in memory) or its a row wise structure that a spark RDD has. The section Spark SQL and DataFrames

Re: Filter operation to return two RDDs at once.

2015-06-03 Thread ๏̯͡๏
[] > > > On Wed, Jun 3, 2015 at 2:45 PM, Sean Owen wrote: > >> In the sense here, Spark actually does have operations that make multiple >> RDDs like randomSplit. However there is not an equivalent of the partition >> operation which gives the elements that matched and d

Re: Filter operation to return two RDDs at once.

2015-06-03 Thread Jeff Zhang
ordCount.scala:20 [] On Wed, Jun 3, 2015 at 2:45 PM, Sean Owen wrote: > In the sense here, Spark actually does have operations that make multiple > RDDs like randomSplit. However there is not an equivalent of the partition > operation which gives the elements that matched and did not ma

Re: Filter operation to return two RDDs at once.

2015-06-02 Thread Sean Owen
In the sense here, Spark actually does have operations that make multiple RDDs like randomSplit. However there is not an equivalent of the partition operation which gives the elements that matched and did not match at once. On Wed, Jun 3, 2015, 8:32 AM Jeff Zhang wrote: > As far as I k

Re: Filter operation to return two RDDs at once.

2015-06-02 Thread Jeff Zhang
As far as I know, spark don't support multiple outputs On Wed, Jun 3, 2015 at 2:15 PM, ayan guha wrote: > Why do you need to do that if filter and content of the resulting rdd are > exactly same? You may as well declare them as 1 RDD. > On 3 Jun 2015 15:28, "ÐΞ€ρ@Ҝ (๏̯͡๏)" wrote: > >> I want to

Re: Filter operation to return two RDDs at once.

2015-06-02 Thread ayan guha
Why do you need to do that if filter and content of the resulting rdd are exactly same? You may as well declare them as 1 RDD. On 3 Jun 2015 15:28, "ÐΞ€ρ@Ҝ (๏̯͡๏)" wrote: > I want to do this > > val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId > != NULL_VALUE) > > val

Filter operation to return two RDDs at once.

2015-06-02 Thread ๏̯͡๏
I want to do this val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId != NULL_VALUE) val guidUidMapSessions = rawQtSession.filter(_._2.qualifiedTreatmentId == NULL_VALUE) This will run two different stages can this be done in one stage ? val (qtSessionsWithQt, guidU

Where does Spark persist RDDs on disk?

2015-05-05 Thread hquan
Hi, I'm using persist on different storage levels, but I found no difference on performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might be something wrong with my code... So where can I find the persisted RDDs on disk so that I can make sure they were persisted i

Where does Spark persist RDDs on disk?

2015-05-05 Thread Haoliang Quan
Hi, I'm using persist on different storage levels, but I found no difference on performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might be something wrong with my code... So where can I find the persisted RDDs on disk so that I can make sure they were persisted indeed?

Re: Saving RDDs as custom output format

2015-04-14 Thread Akhil Das
You can try using ORCOutputFormat with yourRDD.saveAsNewAPIHadoopFile Thanks Best Regards On Tue, Apr 14, 2015 at 9:29 PM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > Is it possible to store RDDs as custom output formats, For example ORC? > > Thanks, > Daniel >

Saving RDDs as custom output format

2015-04-14 Thread Daniel Haviv
Hi, Is it possible to store RDDs as custom output formats, For example ORC? Thanks, Daniel

Re: SparkSQL - Caching RDDs

2015-04-01 Thread Michael Armbrust
What do you mean by "permanently". If you start up the JDBC server and say CACHE TABLE it will stay cached as long as the server is running. CACHE TABLE is idempotent, so you could even just have that command in your BI tools setup queries. On Wed, Apr 1, 2015 at 11:02 AM, Venkat, Ankam wrote:

SparkSQL - Caching RDDs

2015-04-01 Thread Venkat, Ankam
I am trying to integrate SparkSQL with a BI tool. My requirement is to query a Hive table very frequently from the BI tool. Is there a way to cache the Hive Table permanently in SparkSQL? I don't want to read the Hive table and cache it everytime the query is submitted from BI tool. Thanks! R

Re: can't union two rdds

2015-03-31 Thread ankurjain.nitrr
case. If that amount for data is less, you can use rdd.collect, just iterate on it both the list and produce the desired result -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-t-union-two-rdds-tp22320p22323.html Sent from the Apache Spark User List mailing

Re: can't union two rdds

2015-03-31 Thread roy
use zip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-t-union-two-rdds-tp22320p22321.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e

Re: Combining Many RDDs

2015-03-27 Thread Yang Chen
gt; > On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen wrote: > >> Hi Mark, >> >> That's true, but in neither way can I combine the RDDs, so I have to >> avoid unions. >> >> Thanks, >> Yang >> >> On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra

Re: Combining Many RDDs

2015-03-26 Thread Noorul Islam K M
Yang Chen writes: > Hi Noorul, > > Thank you for your suggestion. I tried that, but ran out of memory. I did > some search and found some suggestions > that we should try to avoid rdd.union( > http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-

Re: Combining Many RDDs

2015-03-26 Thread Kelvin Chu
Kelvin On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen wrote: > Hi Mark, > > That's true, but in neither way can I combine the RDDs, so I have to avoid > unions. > > Thanks, > Yang > > On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra > wrote: > >> RDD#union is

Re: Combining Many RDDs

2015-03-26 Thread Yang Chen
Hi Mark, That's true, but in neither way can I combine the RDDs, so I have to avoid unions. Thanks, Yang On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra wrote: > RDD#union is not the same thing as SparkContext#union > > On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen wrote:

Re: Combining Many RDDs

2015-03-26 Thread Mark Hamstra
d.union( > http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark > ). > I will try to come up with some other ways. > > Thank you, > Yang > > On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M > wrote: > >> s

Re: Combining Many RDDs

2015-03-26 Thread Yang Chen
Hi Noorul, Thank you for your suggestion. I tried that, but ran out of memory. I did some search and found some suggestions that we should try to avoid rdd.union( http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark ). I will try

Re: Combining Many RDDs

2015-03-26 Thread Noorul Islam K M
sparkx writes: > Hi, > > I have a Spark job and a dataset of 0.5 Million items. Each item performs > some sort of computation (joining a shared external dataset, if that does > matter) and produces an RDD containing 20-500 result items. Now I would like > to combine all these

Combining Many RDDs

2015-03-26 Thread sparkx
Hi, I have a Spark job and a dataset of 0.5 Million items. Each item performs some sort of computation (joining a shared external dataset, if that does matter) and produces an RDD containing 20-500 result items. Now I would like to combine all these RDDs and perform a next job. What I have found

Re: writing DStream RDDs to the same file

2015-03-26 Thread Akhil Das
"\n") fw.close() } }) Sending from cellphone, not sure how the code snippet will look. :) On 26 Mar 2015 01:20, "Adrian Mocanu" wrote: > Hi > > Is there a way to write all RDDs in a DStream to the same file? > > I tried this and got an empty file. I think

writing DStream RDDs to the same file

2015-03-25 Thread Adrian Mocanu
Hi Is there a way to write all RDDs in a DStream to the same file? I tried this and got an empty file. I think it's bc the file is not closed i.e. ESMinibatchFunctions.writer.close() executes before the stream is created. Here's my code myStream.foreachRDD(rdd => {

RDD pair to pair of RDDs

2015-03-18 Thread Alex Turner (TMS)
What's the best way to go from: RDD[(A, B)] to (RDD[A], RDD[B]) If I do: def separate[A, B](k: RDD[(A, B)]) = (k.map(_._1), k.map(_._2)) Which is the obvious solution, this runs two maps in the cluster. Can I do some kind of a fold instead: def separate[A, B](l: List[(A, B)]) = l.foldLeft(Li

Re: order preservation with RDDs

2015-03-16 Thread kian.ho
For those still interested, I raised this issue on JIRA and received an official response: https://issues.apache.org/jira/browse/SPARK-6340 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052p22088.html Sent from the Apache

Re: order preservation with RDDs

2015-03-15 Thread Sean Owen
this > issue whilst experimenting with feature extraction for text classification, > where (correct me if I'm wrong) there is no built-in mechanism to keep track > of document-ids through the HashingTF and IDF fitting and transformations. > > Thanks. > > > > -- > View

order preservation with RDDs

2015-03-14 Thread kian.ho
eservation-with-RDDs-tp22052.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

<    1   2   3   4   5   6   >