Broadcast Memory Management

2017-09-20 Thread Matthias Boehm
Hi all, could someone please help me understand the broadcast life cycle in detail, especially with regard to memory management? After reading through the TorrentBroadcast implementation, it seems that for every broadcast object, the driver holds a strong reference to a shallow copy (in

Re: Total memory tracking: request for comments

2017-09-20 Thread Reynold Xin
Thanks. This is an important direction to explore and my apologies for the late reply. One thing that is really hard about this is that with different layers of abstractions, we often use other libraries that might allocate large amount of memory (e.g. snappy library, Parquet itself), which makes

Re: [discuss] Data Source V2 write path

2017-09-20 Thread Reynold Xin
On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan wrote: > Hi all, > > I want to have some discussion about Data Source V2 write path before > starting a voting. > > The Data Source V1 write path asks implementations to write a DataFrame > directly, which is painful: > 1.

Re: New to dev community | Contribution to Mlib

2017-09-20 Thread Seth Hendrickson
I'm not exactly clear on what you're proposing, but this sounds like something that would live as a Spark package - a framework for anomaly detection built on Spark. If there is some specific algorithm you have in mind, it would be good to propose it on JIRA and discuss why you think it needs to

[discuss] Data Source V2 write path

2017-09-20 Thread Wenchen Fan
Hi all, I want to have some discussion about Data Source V2 write path before starting a voting. The Data Source V1 write path asks implementations to write a DataFrame directly, which is painful: 1. Exposing upper-level API like DataFrame to Data Source API is not good for maintenance. 2. Data