Re: treeAggregate timing / SGD performance with miniBatchFraction < 1

2015-09-26 Thread Evan R. Sparks
ld take longer to transfer the local gradient vectors > in that level, since they are dense in every level. Furthermore, the > driver is receiving the result of only 4 tasks, which is relatively > small. > > Mike > > > On 9/26/15, Evan R. Sparks <evan.spa...@gmail.com>

Re: RDD API patterns

2015-09-26 Thread Evan R. Sparks
Mike, I believe the reason you're seeing near identical performance on the gradient computations is twofold 1) Gradient computations for GLM models are computationally pretty cheap from a FLOPs/byte read perspective. They are essentially a BLAS "gemv" call in the dense case, which is well known

Re: Scan Sharing in Spark

2015-05-05 Thread Evan R. Sparks
Scan sharing can indeed be a useful optimization in spark, because you amortize not only the time spent scanning over the data, but also time spent in task launch and scheduling overheads. Here's a trivial example in scala. I'm not aware of a place in SparkSQL where this is used - I'd imagine

Re: Pandas' Shift in Dataframe

2015-04-29 Thread Evan R. Sparks
In general there's a tension between ordered data and set-oriented data model underlying DataFrames. You can force a total ordering on the data, but it may come at a high cost with respect to performance. It would be good to get a sense of the use case you're trying to support, but one suggestion

Re: Storing large data for MLlib machine learning

2015-03-26 Thread Evan R. Sparks
On binary file formats - I looked at HDF5+Spark a couple of years ago and found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs needed filenames as input, you couldn't pass it anything like an InputStream). I don't know if it has gotten any better. Parquet plays much more nicely

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Evan R. Sparks
and not to Fortran blas. Best regards, Alexander -Original Message- From: Ulanov, Alexander Sent: Tuesday, March 24, 2015 6:57 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Hi, I

Re: ideas for MLlib development

2015-03-03 Thread Evan R. Sparks
Hi Robert, There's some work to do LDA via Gibbs sampling in this JIRA: https://issues.apache.org/jira/browse/SPARK-1405 as well as this one: https://issues.apache.org/jira/browse/SPARK-5556 It may make sense to have a more general Gibbs sampling framework, but it might be good to have a few

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Evan R. Sparks
at 3:36 PM, Joseph Bradley jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com wrote: Thanks for compiling all

Re: Using CUDA within Spark / boosting linear algebra

2015-02-25 Thread Evan R. Sparks
One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.org Subject: RE: Using CUDA within Spark

Re: [MLlib] Performance problem in GeneralizedLinearAlgorithm

2015-02-17 Thread Evan R. Sparks
Josh - thanks for the detailed write up - this seems a little funny to me. I agree that with the current code path there is extra work being done than needs to be (e.g. the features are re-scaled at every iteration, but the relatively costly process of fitting the StandardScaler should not be

Re: Spark SQL value proposition in batch pipelines

2015-02-12 Thread Evan R. Sparks
Well, you can always join as many RDDs as you want by chaining them together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of RDDs in this way but 10 is probably doable. That said - SparkSQL has an optimizer under the covers that can make clever decisions e.g. pushing the

Re: Using CUDA within Spark / boosting linear algebra

2015-02-09 Thread Evan R. Sparks
*From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] *Sent:* Friday, February 06, 2015 5:58 PM *To:* Ulanov, Alexander *Cc:* Joseph Bradley; dev@spark.apache.org *Subject:* Re: Using CUDA within Spark / boosting linear algebra I would build OpenBLAS yourself, since good BLAS

Re: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Evan R. Sparks
it. *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] *Sent:* Friday, February 06, 2015 5:19 PM *To:* Ulanov, Alexander *Cc:* Joseph Bradley; dev@spark.apache.org *Subject:* Re: Using CUDA within Spark / boosting linear algebra Getting breeze to pick up the right blas library

Re: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Evan R. Sparks
: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data

Spark SQL Window Functions

2015-02-08 Thread Evan R. Sparks
Currently there's no standard way of handling time series data in Spark. We were kicking around some ideas in the lab today and one thing that came up was SQL Window Functions as a way to support them and query over time series (do things like moving average, etc.) These don't seem to be

Re: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Evan R. Sparks
on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning? Best regards, Alexander *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] *Sent:* Thursday, February 05, 2015 12:09 PM *To:* Ulanov

Re: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Evan R. Sparks
I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many cases. You might consider taking a look at the codepaths that BIDMat ( https://github.com/BIDData/BIDMat) takes and comparing them to netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing to make

Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Evan R. Sparks
I'm +1 on this, although a little worried about unknowingly introducing SparkSQL dependencies every time someone wants to use this. It would be great if the interface can be abstract and the implementation (in this case, SparkSQL backend) could be swapped out. One alternative suggestion on the

Re: Notes on writing complex spark applications

2014-11-24 Thread Evan R. Sparks
! 2014-11-24 2:17 GMT+09:00 Sam Bessalah samkiller@gmail.com: Thanks Evan, this is great. On Nov 23, 2014 5:58 PM, Evan R. Sparks evan.spa...@gmail.com wrote: Hi all, Shivaram Venkataraman, Joseph Gonzalez, Tomer Kaftan, and I have been working on a short document about

Notes on writing complex spark applications

2014-11-23 Thread Evan R. Sparks
Hi all, Shivaram Venkataraman, Joseph Gonzalez, Tomer Kaftan, and I have been working on a short document about writing high performance Spark applications based on our experience developing MLlib, GraphX, ml-matrix, pipelines, etc. It may be a useful document both for users and new Spark

Re: Gaussian Mixture Model clustering

2014-09-19 Thread Evan R. Sparks
Hey Meethu - what are you setting K to in the benchmarks you show? This can greatly affect the runtime. On Thu, Sep 18, 2014 at 10:38 PM, Meethu Mathew meethu.mat...@flytxt.com wrote: Hi all, Please find attached the image of benchmark results. The table in the previous mail got messed up.

Re: [mllib] Add multiplying large scale matrices

2014-09-05 Thread Evan R. Sparks
There's some work on this going on in the AMP Lab. Create a ticket and we can update with our progress so that we don't duplicate effort. On Fri, Sep 5, 2014 at 8:18 AM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com wrote: Hi RJ, Thank you for your comment. I am interested in to have other matrix

Re: Is breeze thread safe in Spark?

2014-09-03 Thread Evan R. Sparks
Additionally, at the higher level, MLlib allocates separate Breeze Vectors/Matrices on a Per-executor basis. The only place I can think of where data structures might be over-written concurrently is in a .aggregate() call, and these calls happen sequentially. RJ - Do you have a JIRA reference for

Re: Could the function MLUtils.loadLibSVMFile be modified to support zero-based-index data?

2014-07-08 Thread Evan R. Sparks
As Sean mentions, if you can change the data to the standard format, that's probably a good idea. If you'd rather read the data raw, then writing your own version of loadLibSVMFile - then you could make your own loader function which is very similar to the existing one with a few characters

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Evan R. Sparks
If you're thinking along these lines, have a look at the DecisionTree implementation in MLlib. It uses the same idea and is optimized to prevent multiple passes over the data by computing several splits at each level of tree building. The tradeoff is increased model state and computation per pass

Re: Contributing to MLlib

2014-07-02 Thread Evan R. Sparks
Hi there, Generally we try to avoid duplicating logic if possible, particularly for algorithms that share a great deal of algorithmic similarity. See, for example, the way we implement Logistic regression vs. Linear regression vs. Linear SVM with different gradient functions all on top of SGD or

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Evan R. Sparks
While DBSCAN and others would be welcome contributions, I couldn't agree more with Sean. On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen so...@cloudera.com wrote: Nobody asked me, and this is a comment on a broader question, not this one, but: In light of a number of recent items about adding