Re: taking an n number of rows from and RDD starting from an index

2015-09-01 Thread Hemant Bhanawat
I think rdd.toLocalIterator is what you want. But it will keep one partition's data in-memory. On Wed, Sep 2, 2015 at 10:05 AM, Niranda Perera wrote: > Hi all, > > I have a large set of data which would not fit into the memory. So, I wan > to take n number of data from the RDD given a particular

OOM in spark driver

2015-09-01 Thread ankit tyagi
Hi All, I am using spark-sql 1.3.1 with hadoop 2.4.0 version. I am running sql query against parquet files and wanted to save result on s3 but looks like https://issues.apache.org/jira/browse/SPARK-2984 problem still coming while saving data to s3. Hence Now i am saving result on hdfs and with t

taking an n number of rows from and RDD starting from an index

2015-09-01 Thread Niranda Perera
Hi all, I have a large set of data which would not fit into the memory. So, I wan to take n number of data from the RDD given a particular index. for an example, take 1000 rows starting from the index 1001. I see that there is a take(num: Int): Array[T] method in the RDD, but it only returns the

[ compress in-memory column storage used in sparksql cache table ]

2015-09-01 Thread Wangchangchun (A)
Hi, I have an idea, can someone give me some advice? I want to compress data in in-memory column storage which is used by cache table in spark. This will make cache table use less memory. I will set an conf to this function, so if anyone want to use this function, he can set this conf to t

Re: Tungsten off heap memory access for C++ libraries

2015-09-01 Thread Paul Weiss
https://issues.apache.org/jira/browse/SPARK-10399 Is the jira to track. On Sep 1, 2015 5:32 PM, "Paul Wais" wrote: > Paul: I've worked on running C++ code on Spark at scale before (via JNA, > ~200 > cores) and am working on something more contribution-oriented now (via > JNI). > A few comments:

Use of UnsafeRow

2015-09-01 Thread Ulanov, Alexander
Dear Spark developers, Could you suggest what is the intended use of UnsafeRow (except for Tungsten groupBy and sort) and give an example how to use it? 1)Is it intended to be instantiated as the copy of the Row in order to perform in-place modifications of it? 2)Can I create a new UnsafeRow giv

Re: Tungsten off heap memory access for C++ libraries

2015-09-01 Thread Paul Wais
Paul: I've worked on running C++ code on Spark at scale before (via JNA, ~200 cores) and am working on something more contribution-oriented now (via JNI). A few comments: * If you need something *today*, try JNA. It can be slow (e.g. a short native function in a tight loop) but works if you have

[VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-01 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.0 [ ] -1 Do not release this package because ... To

Resource allocation in SPARK streaming

2015-09-01 Thread anshu shukla
I am not much clear about resource allocation (CPU/CORE/Thread level allocation) as per the parallelism by setting number of cores in spark standalone mode . Any guidelines for that . -- Thanks & Regards, Anshu Shukla

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-09-01 Thread Chester Chen
Thanks Sean, that make it clear. On Tue, Sep 1, 2015 at 7:17 AM, Sean Owen wrote: > Any 1.5 RC comes from the latest state of the 1.5 branch at some point > in time. The next RC will be cut from whatever the latest commit is. > You can see the tags in git for the specific commits for each RC. >

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-09-01 Thread Sean Owen
Any 1.5 RC comes from the latest state of the 1.5 branch at some point in time. The next RC will be cut from whatever the latest commit is. You can see the tags in git for the specific commits for each RC. There's no such thing as "1.5.1 SNAPSHOT" commits, just commits to branch 1.5. I would ignore

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-09-01 Thread chester
Thanks for the explanation. Since 1.5.0 rc3 is not yet released, I assume it would cut from 1.5 branch, doesn't that bring 1.5.1 snapshot code ? The reason I am asking these questions is that I would like to know If I want build 1.5.0 myself, which commit should I use ? Sent from my iPad >

[SparkR] lint script for SpakrR

2015-09-01 Thread Yu Ishikawa
Hi all, Shivaram and I added a lint script for SparkR which is `dev/lint-r`. And it's been already running on Jenkins. If there are any validation problems in your patch, Jenkins will fail. Could you please make sure that your patch don't have any validation problems on your local machine before

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-09-01 Thread Sean Owen
The head of branch 1.5 will always be a "1.5.x-SNAPSHOT" version. Yeah technically you would expect it to be 1.5.0-SNAPSHOT until 1.5.0 is released. In practice I think it's simpler to follow the defaults of the Maven release plugin, which will set this to 1.5.1-SNAPSHOT after any 1.5.0-rc is relea

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-09-01 Thread chester
Sorry, I am still not follow. I assume the release would build from 1.5.0 before moving to 1.5.1. Are you saying the 1.5.0 rc3 could build from 1.5.1 snapshot during release ? Or 1.5.0 rc3 would build from the last commit of 1.5.0 (before changing to 1.5.1 snapshot) ? Sent from my iPad > On

Re: Tungsten off heap memory access for C++ libraries

2015-09-01 Thread Reynold Xin
Please do. Thanks. On Mon, Aug 31, 2015 at 5:00 AM, Paul Weiss wrote: > Sounds good, want me to create a jira and link it to SPARK-9697? Will put > down some ideas to start. > On Aug 31, 2015 4:14 AM, "Reynold Xin" wrote: > >> BTW if you are interested in this, we could definitely get some help

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-09-01 Thread Sean Owen
That's correct for the 1.5 branch, right? this doesn't mean that the next RC would have this value. You choose the release version during the release process. On Tue, Sep 1, 2015 at 2:40 AM, Chester Chen wrote: > Seems that Github branch-1.5 already changing the version to 1.5.1-SNAPSHOT, > > I a