ld take longer to transfer the local gradient vectors
> in that level, since they are dense in every level. Furthermore, the
> driver is receiving the result of only 4 tasks, which is relatively
> small.
>
> Mike
>
>
> On 9/26/15, Evan R. Sparks <evan.spa...@gmail.com>
Mike,
I believe the reason you're seeing near identical performance on the
gradient computations is twofold
1) Gradient computations for GLM models are computationally pretty cheap
from a FLOPs/byte read perspective. They are essentially a BLAS "gemv" call
in the dense case, which is well known
Scan sharing can indeed be a useful optimization in spark, because you
amortize not only the time spent scanning over the data, but also time
spent in task launch and scheduling overheads.
Here's a trivial example in scala. I'm not aware of a place in SparkSQL
where this is used - I'd imagine
In general there's a tension between ordered data and set-oriented data
model underlying DataFrames. You can force a total ordering on the data,
but it may come at a high cost with respect to performance.
It would be good to get a sense of the use case you're trying to support,
but one suggestion
On binary file formats - I looked at HDF5+Spark a couple of years ago and
found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
needed filenames as input, you couldn't pass it anything like an
InputStream). I don't know if it has gotten any better.
Parquet plays much more nicely
and not to Fortran blas.
Best regards, Alexander
-Original Message-
From: Ulanov, Alexander
Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra
Hi,
I
Hi Robert,
There's some work to do LDA via Gibbs sampling in this JIRA:
https://issues.apache.org/jira/browse/SPARK-1405 as well as this one:
https://issues.apache.org/jira/browse/SPARK-5556
It may make sense to have a more general Gibbs sampling framework, but it
might be good to have a few
at 3:36 PM, Joseph Bradley jos...@databricks.com
wrote:
Better documentation for linking would be very helpful! Here's a JIRA:
https://issues.apache.org/jira/browse/SPARK-6019
On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:
Thanks for compiling all
One thing still needs exploration: does BIDMat-cublas perform copying
to/from machine’s RAM?
-Original Message-
From: Ulanov, Alexander
Sent: Tuesday, February 10, 2015 2:12 PM
To: Evan R. Sparks
Cc: Joseph Bradley; dev@spark.apache.org
Subject: RE: Using CUDA within Spark
Josh - thanks for the detailed write up - this seems a little funny to me.
I agree that with the current code path there is extra work being done than
needs to be (e.g. the features are re-scaled at every iteration, but the
relatively costly process of fitting the StandardScaler should not be
Well, you can always join as many RDDs as you want by chaining them
together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of
RDDs in this way but 10 is probably doable.
That said - SparkSQL has an optimizer under the covers that can make clever
decisions e.g. pushing the
*From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
*Sent:* Friday, February 06, 2015 5:58 PM
*To:* Ulanov, Alexander
*Cc:* Joseph Bradley; dev@spark.apache.org
*Subject:* Re: Using CUDA within Spark / boosting linear algebra
I would build OpenBLAS yourself, since good BLAS
it.
*From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
*Sent:* Friday, February 06, 2015 5:19 PM
*To:* Ulanov, Alexander
*Cc:* Joseph Bradley; dev@spark.apache.org
*Subject:* Re: Using CUDA within Spark / boosting linear algebra
Getting breeze to pick up the right blas library
: Thursday, February 05, 2015 5:29 PM
To: Ulanov, Alexander
Cc: Evan R. Sparks; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra
Hi Alexander,
Using GPUs with Spark would be very exciting. Small comment: Concerning
your question earlier about keeping data
Currently there's no standard way of handling time series data in Spark. We
were kicking around some ideas in the lab today and one thing that came up
was SQL Window Functions as a way to support them and query over time
series (do things like moving average, etc.)
These don't seem to be
on how these all might be
connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
you take BIDMach for optimization and learning?
Best regards, Alexander
*From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
*Sent:* Thursday, February 05, 2015 12:09 PM
*To:* Ulanov
I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
many cases.
You might consider taking a look at the codepaths that BIDMat (
https://github.com/BIDData/BIDMat) takes and comparing them to
netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
to make
I'm +1 on this, although a little worried about unknowingly introducing
SparkSQL dependencies every time someone wants to use this. It would be
great if the interface can be abstract and the implementation (in this
case, SparkSQL backend) could be swapped out.
One alternative suggestion on the
!
2014-11-24 2:17 GMT+09:00 Sam Bessalah samkiller@gmail.com:
Thanks Evan, this is great.
On Nov 23, 2014 5:58 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:
Hi all,
Shivaram Venkataraman, Joseph Gonzalez, Tomer Kaftan, and I have been
working on a short document about
Hi all,
Shivaram Venkataraman, Joseph Gonzalez, Tomer Kaftan, and I have been
working on a short document about writing high performance Spark
applications based on our experience developing MLlib, GraphX, ml-matrix,
pipelines, etc. It may be a useful document both for users and new Spark
Hey Meethu - what are you setting K to in the benchmarks you show? This
can greatly affect the runtime.
On Thu, Sep 18, 2014 at 10:38 PM, Meethu Mathew meethu.mat...@flytxt.com
wrote:
Hi all,
Please find attached the image of benchmark results. The table in the
previous mail got messed up.
There's some work on this going on in the AMP Lab. Create a ticket and we
can update with our progress so that we don't duplicate effort.
On Fri, Sep 5, 2014 at 8:18 AM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com
wrote:
Hi RJ,
Thank you for your comment. I am interested in to have other matrix
Additionally, at the higher level, MLlib allocates separate Breeze
Vectors/Matrices on a Per-executor basis. The only place I can think of
where data structures might be over-written concurrently is in a
.aggregate() call, and these calls happen sequentially.
RJ - Do you have a JIRA reference for
As Sean mentions, if you can change the data to the standard format, that's
probably a good idea. If you'd rather read the data raw, then writing your
own version of loadLibSVMFile - then you could make your own loader
function which is very similar to the existing one with a few characters
If you're thinking along these lines, have a look at the DecisionTree
implementation in MLlib. It uses the same idea and is optimized to prevent
multiple passes over the data by computing several splits at each level of
tree building. The tradeoff is increased model state and computation per
pass
Hi there,
Generally we try to avoid duplicating logic if possible, particularly for
algorithms that share a great deal of algorithmic similarity. See, for
example, the way we implement Logistic regression vs. Linear regression vs.
Linear SVM with different gradient functions all on top of SGD or
While DBSCAN and others would be welcome contributions, I couldn't agree
more with Sean.
On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen so...@cloudera.com wrote:
Nobody asked me, and this is a comment on a broader question, not this
one, but:
In light of a number of recent items about adding
27 matches
Mail list logo