date:20160401

RE: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-04-01 Thread Raymond Honderdors

What about a seperate branch for scala 2.10? Sent from my Samsung Galaxy smartphone. Original message From: Koert Kuipers Date: 4/2/2016 02:10 (GMT+02:00) To: Michael Armbrust Cc: Matei Zaharia , Mark Hamstra , Cody Koeninger , Sean Owen , dev@spark.apache.org Subject: Re

RE: Declare rest of @Experimental items non-experimental if they'veexisted since 1.2.0

2016-04-01 Thread Renyi Xiong

Thanks a lot, Sean, really appreciate your comments. Sent from my Windows 10 phone From: Sean Owen Sent: Friday, April 1, 2016 12:55 PM To: Renyi Xiong Cc: Tathagata Das; dev Subject: Re: Declare rest of @Experimental items non-experimental if they'veexisted since 1.2.0 The change there was jus

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Saisai Shao

So I think ramdisk is simple way to try. Besides I think Reynold's suggestion is quite valid, with such high-end machine, putting everything in memory might not improve the performance a lot as assumed. Since bottleneck will be shifted, like memory bandwidth, NUMA, CPU efficiency (serialization-de

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch

Yes we see it on final write. Our preference is to eliminate this. On Fri, Apr 1, 2016, 7:25 PM Saisai Shao wrote: > Hi Michael, shuffle data (mapper output) have to be materialized into disk > finally, no matter how large memory you have, it is the design purpose of > Spark. In you scenario, s

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Saisai Shao

Hi Michael, shuffle data (mapper output) have to be materialized into disk finally, no matter how large memory you have, it is the design purpose of Spark. In you scenario, since you have a big memory, shuffle spill should not happen frequently, most of the disk IO you see might be final shuffle fi

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-04-01 Thread Koert Kuipers

as long as we don't lock ourselves into supporting scala 2.10 for the entire spark 2 lifespan it sounds reasonable to me On Wed, Mar 30, 2016 at 3:25 PM, Michael Armbrust wrote: > +1 to Matei's reasoning. > > On Wed, Mar 30, 2016 at 9:21 AM, Matei Zaharia > wrote: > >> I agree that putting it i

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch

As I mentioned earlier this flag is now ignored. On Fri, Apr 1, 2016, 6:39 PM Michael Slavitch wrote: > Shuffling a 1tb set of keys and values (aka sort by key) results in about > 500gb of io to disk if compression is enabled. Is there any way to > eliminate shuffling causing io? > > On Fri, Ap

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin

It's spark.local.dir. On Fri, Apr 1, 2016 at 3:37 PM, Yong Zhang wrote: > Is there a configuration in the Spark of location of "shuffle spilling"? I > didn't recall ever see that one. Can you share it out? > > It will be good for a test writing to RAM Disk if that configuration is > available.

RE: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Yong Zhang

Is there a configuration in the Spark of location of "shuffle spilling"? I didn't recall ever see that one. Can you share it out? It will be good for a test writing to RAM Disk if that configuration is available. Thanks Yong From: r...@databricks.com Date: Fri, 1 Apr 2016 15:32:23 -0700 Subject:

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch

Shuffling a 1tb set of keys and values (aka sort by key) results in about 500gb of io to disk if compression is enabled. Is there any way to eliminate shuffling causing io? On Fri, Apr 1, 2016, 6:32 PM Reynold Xin wrote: > Michael - I'm not sure if you actually read my email, but spill has > no

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin

Michael - I'm not sure if you actually read my email, but spill has nothing to do with the shuffle files on disk. It was for the partitioning (i.e. sorting) process. If that flag is off, Spark will just run out of memory when data doesn't fit in memory. On Fri, Apr 1, 2016 at 3:28 PM, Michael Sla

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch

RAMdisk is a fine interim step but there is a lot of layers eliminated by keeping things in memory unless there is need for spillover. At one time there was support for turning off spilling. That was eliminated. Why? On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan wrote: > I think Reynold's

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Mridul Muralidharan

I think Reynold's suggestion of using ram disk would be a good way to test if these are the bottlenecks or something else is. For most practical purposes, pointing local dir to ramdisk should effectively give you 'similar' performance as shuffling from memory. Are there concerns with taking that a

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin

If you work for a certain hardware vendor that builds expensive, high performance nodes, and want to use Spark to demonstrate the performance gains of your new great systems, you will of course totally disagree. Anyway - I offered you a simple solution to work around the low hanging fruits. Feel f

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin

Sure - feel free to totally disagree. On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch wrote: > I totally disagree that it’s not a problem. > > - Network fetch throughput on 40G Ethernet exceeds the throughput of NVME > drives. > - What Spark is depending on is Linux’s IO cache as an effective

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch

I totally disagree that it’s not a problem. - Network fetch throughput on 40G Ethernet exceeds the throughput of NVME drives. - What Spark is depending on is Linux’s IO cache as an effective buffer pool This is fine for small jobs but not for jobs with datasets in the TB/node range. - On larger

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin

spark.shuffle.spill actually has nothing to do with whether we write shuffle files to disk. Currently it is not possible to not write shuffle files to disk, and typically it is not a problem because the network fetch throughput is lower than what disks can sustain. In most cases, especially with SS

Re: What influences the space complexity of Spark operations?

2016-04-01 Thread Michael Armbrust

Blocking operators like Sort, Join or Aggregate will put all of the data for a whole partition into a hash table or array. However, if you are running Spark 1.5+ we should be spilling to disk. In Spark 1.6 if you are seeing OOMs for SQL operations you should report it as a bug. On Thu, Mar 31, 2

Re: Spark SQL UDF Returning Rows

2016-04-01 Thread Michael Armbrust

> > I haven't looked at Encoders or Datasets since we're bound to 1.6 for now > but I'll look at encoders to see if that covers it. Datasets seems like it > would solve this problem for sure. > There is an experimental preview of Datasets in Spark 1.6 > I avoided returning a case object because

Re: Declare rest of @Experimental items non-experimental if they've existed since 1.2.0

2016-04-01 Thread Sean Owen

The change there was just to mark the methods non-experimental. The logic was that they'd been around for many releases without change, and are unlikely to be changed now that they've been in the wild so long, so already acted as if they're part of the normal stable API. Are they important? I pers

Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch

Hello; I’m working on spark with very large memory systems (2TB+) and notice that Spark spills to disk in shuffle. Is there a way to force spark to stay in memory when doing shuffle operations? The goal is to keep the shuffle data either in the heap or in off-heap memory (in 1.6.x) and never

Declare rest of @Experimental items non-experimental if they've existed since 1.2.0

2016-04-01 Thread Renyi Xiong

Hi Sean, We're upgrading Mobius (C# binding for Spark) in Microsoft to align with Spark 1.6.2 and noticed some changes in API you did in https://github.com/apache/spark/commit/6f81eae24f83df51a99d4bb2629dd7daadc01519 mostly on APIs with Approx postfix. (still marked as experimental in pyspark t

Re: how about a custom coalesce() policy?

2016-04-01 Thread Nezih Yigitbasi

Hey Reynold, Created an issue (and a PR) for this change to get discussions started. Thanks, Nezih On Fri, Feb 26, 2016 at 12:03 AM Reynold Xin wrote: > Using the right email for Nezih > > > On Fri, Feb 26, 2016 at 12:01 AM, Reynold Xin wrote: > >> I think this can be useful. >> >> The only th

Re: [discuss] using deep learning to improve Spark

2016-04-01 Thread Ricardo Almeida

Amazing! I'll fund $1/2 million for such a interesting initiative. Oh, wait... I have only $4 on my pocket Cheers :) On 1 April 2016 at 11:40, Takeshi Yamamuro wrote: > Oh, the annual event... > > On Fri, Apr 1, 2016 at 4:37 PM, Xiao Li wrote: > >> April 1st... : ) >> >> 2016-04-01 0:33 GMT-07

Re: Any documentation on Spark's security model beyond YARN?

2016-04-01 Thread Michael Segel

Guys, Getting a bit off topic. Saying Security and HBase in the same sentence is a bit of a joke until HBase rejiggers its co-processers. Although’s Andrew’s fix could be enough to keep CSOs and their minions happy. The larger picture is that Security has to stop being a ‘second thought’.

Re: [discuss] using deep learning to improve Spark

2016-04-01 Thread Takeshi Yamamuro

Oh, the annual event... On Fri, Apr 1, 2016 at 4:37 PM, Xiao Li wrote: > April 1st... : ) > > 2016-04-01 0:33 GMT-07:00 Michael Malak : > >> I see you've been burning the midnight oil. >> >> >> -- >> *From:* Reynold Xin >> *To:* "dev@spark.apache.org" >> *Sent:* Fri

Re: [discuss] using deep learning to improve Spark

2016-04-01 Thread Xiao Li

April 1st... : ) 2016-04-01 0:33 GMT-07:00 Michael Malak : > I see you've been burning the midnight oil. > > > -- > *From:* Reynold Xin > *To:* "dev@spark.apache.org" > *Sent:* Friday, April 1, 2016 1:15 AM > *Subject:* [discuss] using deep learning to improve Spark

Re: [discuss] using deep learning to improve Spark

2016-04-01 Thread Michael Malak

I see you've been burning the midnight oil. From: Reynold Xin To: "dev@spark.apache.org" Sent: Friday, April 1, 2016 1:15 AM Subject: [discuss] using deep learning to improve Spark Hi all, Hope you all enjoyed the Tesla 3 unveiling earlier tonight. I'd like to bring your attention

[discuss] using deep learning to improve Spark

2016-04-01 Thread Reynold Xin

Hi all, Hope you all enjoyed the Tesla 3 unveiling earlier tonight. I'd like to bring your attention to a project called DeepSpark that we have been working on for the past three years. We realized that scaling software development was challenging. A large fraction of software engineering has bee

RE: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

RE: Declare rest of @Experimental items non-experimental if they'veexisted since 1.2.0

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

RE: Eliminating shuffle write and spill disk IO reads/writes in Spark

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Re: What influences the space complexity of Spark operations?

Re: Spark SQL UDF Returning Rows

Re: Declare rest of @Experimental items non-experimental if they've existed since 1.2.0

Eliminating shuffle write and spill disk IO reads/writes in Spark

Declare rest of @Experimental items non-experimental if they've existed since 1.2.0

Re: how about a custom coalesce() policy?

Re: [discuss] using deep learning to improve Spark

Re: Any documentation on Spark's security model beyond YARN?

Re: [discuss] using deep learning to improve Spark

Re: [discuss] using deep learning to improve Spark

Re: [discuss] using deep learning to improve Spark

[discuss] using deep learning to improve Spark

29 matches

Site Navigation

Mail list logo

Footer information