Re: Test coverage of Spark

2013-10-12 Thread Christopher Nguyen
Roman, an area I think would (a) have high impact, and (b) is relatively not well covered is performance analysis. I'm sure most teams are doing this internally at their respective companies, but there is no shared code base and shared wisdom about what we're finding/improving. For example, consid

Re: Test coverage of Spark

2013-10-12 Thread Christopher Nguyen
t;. > > > On Sat, Oct 12, 2013 at 2:22 PM, Christopher Nguyen > wrote: > > > Roman, an area I think would (a) have high impact, and (b) is relatively > > not well covered is performance analysis. I'm sure most teams are doing > > this internally at their re

Re: MBrace: Cloud Computing with Monads

2013-10-23 Thread Christopher Nguyen
Re MBrace: very interesting work. I'm a bit surprised though that the paper makes no mention of DryadLINQ ( http://research.microsoft.com/en-us/projects/dryadlinq/dryadlinq.pdf). Architecturally it's a lot easier to see an MBrace implementation specialized to a MapReduce (or more generically, a BS

Fwd: MBrace: Cloud Computing with Monads

2013-10-23 Thread Christopher Nguyen
raphs can be modified dynamically based on > user code. It's also not clear what the granularity of task spawns in > MBrace is -- can you spawn stuff that runs for 1 millisecond, or 1 second, > or 1 hour? The choice there greatly affects system design. > > Matei > > On Oct 23

Re: Documentation of Java API and PySpark internals

2013-10-24 Thread Christopher Nguyen
+1 thanks, I learned a lot more about Spark's JavaAPI motivations from this documentation. We've run into many of these issues in our own mixed code base, particularly with implicit manifests and JVM type erasure. Sent while mobile. Pls excuse typos etc. On Oct 23, 2013 8:57 PM, "Josh Rosen" wrot

Re: a question about lineage graphs in streaming

2013-11-02 Thread Christopher Nguyen
Dachuan, you may have correctly answered your own question. See Fig. 3 of the same paper, where "infinity" occurs in the vertical direction. -- Christopher T. Nguyen Co-founder & CEO, Adatao linkedin.com/in/ctnguyen On Sat, Nov 2, 2013 at 7:51 AM, dachuan wrote: > Hi, deve

Re: Spark development for undergraduate project

2013-12-17 Thread Christopher Nguyen
Matt, some suggestions. If you're interested in the machine-learning layer, perhaps you could look into helping to harmonize our (Adatao) dataframe representation with MLlib's, and base RDDs for that matter. It requires someone to spend some dedicated time looking into the trade-offs between gener

Re: Spark development for undergraduate project

2013-12-19 Thread Christopher Nguyen
+1 to most of Andrew's suggestions here, and while we're in that neighborhood, how about generalizing something like "wtf-spark" (from the Bizo team (http://youtu.be/6Sn1xs5DN1Y?t=38m36s)? It may not be of high academic interest, but it's something people would use many times a debugging day. Or a

Re: Large DataStructure to Broadcast

2013-12-25 Thread Christopher Nguyen
Purav, depending on the access pattern you should also consider the trade-offs of setting up a lookup service (using, e.g., memcached, egad!) which may end up being more efficient overall. The general point is not to restrict yourself to only Spark APIs when considering the overall architecture. -

Re: Option folding idiom

2013-12-26 Thread Christopher Nguyen
+1 as you can't fight the future, but clear warning signs ahead would be helpful :) Just be careful that it's not an exact equivalent to *match*, else we can get confused by behavior like this: *scala> class parentdefined class parent* *scala> class child1 extends parentdefined class child1*

Re: Option folding idiom

2013-12-27 Thread Christopher Nguyen
I've learned and unlearned enough things to be careful when claiming something is "more intuitive" than another, since it's subject to prior knowledge. When I first encountered map().getOrElse() it wasn't any more intuitive than this fold()() syntax. Maybe the "OrElse" helps a bit, but the "get" in

Re: Large DataStructure to Broadcast

2014-01-07 Thread Christopher Nguyen
the other threads using "synchronized" while one of them is loading > the large file for me. > Any suggestions what can that unique object specific to that particular JVM > be. Is SparkContext an option ? > > > > On Thu, Dec 26, 2013 at 10:41 AM, Christopher Nguyen &

Re: The performance of group operation on SSD

2014-01-17 Thread Christopher Nguyen
Chen, I would also look at actual I/O patterns of the operations. SSDs writes are sensitive to significantly variable performance depending on the exact scenario, and can easily underperform HDD given the "right" conditions. Generically quoted IOPS numbers are not reliable across a variety of commo

Re: [DISCUSS] Graduating as a TLP

2014-01-23 Thread Christopher Nguyen
Cool, Matei! Sent while mobile. Pls excuse typos etc. On Jan 23, 2014 2:45 PM, "Matei Zaharia" wrote: > Hi folks, > > We’ve been working on the transition to Apache for a while, and our last > shepherd’s report says the following: > > > Spark > > Alan Cabrera (acabrera): > >

Re: [VOTE] Graduation of Apache Spark

2014-01-26 Thread Christopher Nguyen
+1 Sent while mobile. Pls excuse typos etc. On Jan 26, 2014 1:50 PM, "Matei Zaharia" wrote: > Hi guys, > > Discussion has proceeded positively, so I'm calling for a community VOTE > for the graduation of Apache Spark (incubating) into a top level project. > If this VOTE is successful, then I'll

Re: coding style discussion: explicit return type in public APIs

2014-02-18 Thread Christopher Nguyen
Reynold, perhaps better than enumerating all the rules, it should be generalized to a guideline that allows this relaxation when the return type is immediately obvious from the next few tokens in the code. In exceptional cases of doubt, provide the return type. The cases you've listed should serve

Re: coding style discussion: explicit return type in public APIs

2014-02-18 Thread Christopher Nguyen
Mridul, IIUUC, what you've mentioned did come to mind, but I deemed it orthogonal to the stylistic issue Reynold is talking about. I believe you're referring to the case where there is a specific desired return type by API design, but the implementation does not, in which case, of course, one must

Re: coding style discussion: explicit return type in public APIs

2014-02-19 Thread Christopher Nguyen
;> def createFoo = new FooImpl() > >> > >> vs > >> > >> def createFoo: Foo = new FooImpl() > >> > >> Former will cause api instability. Reynold, maybe this is already > >> avoided - and I understood it wrong ? > >> > >

Re: ASF Board Meeting Summary - February 19, 2014

2014-02-20 Thread Christopher Nguyen
Very cool, Andy! -- Christopher T. Nguyen Co-founder & CEO, Adatao linkedin.com/in/ctnguyen On Thu, Feb 20, 2014 at 8:40 AM, Mattmann, Chris A (3980) < chris.a.mattm...@jpl.nasa.gov> wrote: > Sure does! Congrats guys! > > > > -Original Message- > From: Andy Konwinsk