[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Evan Sparks (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390782#comment-14390782
 ] 

Evan Sparks commented on SPARK-6646:


Guys - you're clearly ignoring prior work. The database community solved this 
problem 20 years ago with the Gubba project - a mature prototype [can be seen 
here|http://i.imgur.com/FJK7K9x.jpg]. 

Additionally, everyone knows that joins don't scale on iOS, and you'll never be 
able to build indexes on this platform.


 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3530) Pipeline and Parameters

2015-03-02 Thread Evan Sparks (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344291#comment-14344291
 ] 

Evan Sparks commented on SPARK-3530:


We have looked at integrating Caffe with spark - probably the most logical 
place to do this (as a proof of concept) is via pyspark, since Caffe has python 
bindings. If you're going to use a pre-trained model in Caffe, this should 
parallelize well and might well suit your needs. Training the pipeline in 
parallel is trickier because the communication requirements for training these 
types of networks via mini-batch SGD are very high. 

If you're in the scenario where you want to instantiate a pre-trained Caffe 
network on all the workers (and only do it once), then a broadcast variable is 
probably a good way to go, and I'd expect that it fits nicely into the 
pipelines framework.

 Pipeline and Parameters
 ---

 Key: SPARK-3530
 URL: https://issues.apache.org/jira/browse/SPARK-3530
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.2.0


 This part of the design doc is for pipelines and parameters. I put the design 
 doc at
 https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
 I will copy the proposed interfaces to this JIRA later. Some sample code can 
 be viewed at: https://github.com/mengxr/spark-ml/
 Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5705) Explore GPU-accelerated Linear Algebra Libraries

2015-02-09 Thread Evan Sparks (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313398#comment-14313398
 ] 

Evan Sparks commented on SPARK-5705:


This JIRA is a continuation of this thread: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-td10481.html

To summarise - high-speed linear algebra operations including, but not limited 
to, matrix multiplies and solves have the potential to make certain machine 
learning operations faster on spark. However, we've got to be careful to 
balance the overheads of copying data/calling out to the GPU with other factors 
in the design of the system.

Additionally - getting these libraries compiled, linked, built, and configured 
on a target system is unfortunately not trivial. We should make sure we have a 
standard process for doing this (perhaps starting with this codebase: 
http://github.com/shivaram/matrix-bench).

Maybe we should start with some applications where we think GPU acceleration 
could help? Neural nets is one, LDA is another - others?

 Explore GPU-accelerated Linear Algebra Libraries
 

 Key: SPARK-5705
 URL: https://issues.apache.org/jira/browse/SPARK-5705
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Evan Sparks
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5705) Explore GPU-accelerated Linear Algebra Libraries

2015-02-09 Thread Evan Sparks (JIRA)
Evan Sparks created SPARK-5705:
--

 Summary: Explore GPU-accelerated Linear Algebra Libraries
 Key: SPARK-5705
 URL: https://issues.apache.org/jira/browse/SPARK-5705
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Evan Sparks
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4565) Add docs about advanced spark application development

2014-11-24 Thread Evan Sparks (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223864#comment-14223864
 ] 

Evan Sparks commented on SPARK-4565:


[~pwendell] suggested that we add this to the spark tuning guide and 
programming guide. What do you think?

 Add docs about advanced spark application development
 -

 Key: SPARK-4565
 URL: https://issues.apache.org/jira/browse/SPARK-4565
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Evan Sparks
Priority: Minor

 [~shivaram], [~jegonzal] and I have been working on a brief document based on 
 our experiences writing high performance spark applications - MLlib, GraphX, 
 pipelines, ml-matrix, etc.
 It currently exists here - 
 https://docs.google.com/document/d/1gEIawzRsOwksV_bq4je3ofnd-7Xu-u409mdW-RXTDnQ/edit?usp=sharing
 Would it make sense to add these tips and tricks to the Spark Wiki?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4565) Add docs about advanced spark application development

2014-11-23 Thread Evan Sparks (JIRA)
Evan Sparks created SPARK-4565:
--

 Summary: Add docs about advanced spark application development
 Key: SPARK-4565
 URL: https://issues.apache.org/jira/browse/SPARK-4565
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Evan Sparks
Priority: Minor


[~shivaram], [~jegonzal] and I have been working on a brief document based on 
our experiences writing high performance spark applications - MLlib, GraphX, 
pipelines, ml-matrix, etc.

It currently exists here - 
https://docs.google.com/document/d/1gEIawzRsOwksV_bq4je3ofnd-7Xu-u409mdW-RXTDnQ/edit?usp=sharing

Would it make sense to add these tips and tricks to the Spark Wiki?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Evan Sparks (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222105#comment-14222105
 ] 

Evan Sparks commented on SPARK-1405:


[~gq] - Those are great numbers for a very high number of topics - it's a 
little tough to follow what's leading to the super-linear scaling in #topics in 
your code, though. Are you using FastLDA or something similar to speed up 
sampling? (http://www.ics.uci.edu/~newman/pubs/fastlda.pdf)

Pedro has been testing on a wikipedia dump on s3 which I provided. It's XML 
formatted, one document per line, so it's easy to parse. I will copy this to a 
requester-pays bucket (which will be free if you run your experiments on ec2) 
now so that everyone working on this can use it for testing.

NIPS dataset seems fine for small-scale testing, but I think it's important 
that we test this implementation across a range of values for documents, words, 
topics, and tokens - hence, I think the data generator that Pedro is working on 
is a really good idea (and follows the convention of the existing data 
generators in MLlib). We'll have to be a little careful here, because some of 
the methods for making LDA fast rely on the fact that it tends to converge 
fast, and I expect that data generated by the model will be much easier to fit 
than real data.

Also, can we try and be consistent in our terminology - getting the # of unique 
words confused with all the words in a corpus is easy. I propose words and 
tokens for these two things.

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Evan Sparks (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222112#comment-14222112
 ] 

Evan Sparks commented on SPARK-1405:


Bucket has been created: 
s3://files.sparks.requester.pays/enwiki_category_text/ - All in all there are 
181 ~50mb files (actually closer to 10GB). 

It probably makes sense to use http://sweble.org/ or something to strip the 
boilerplate, etc. from the documents for the purposes of topic modeling.

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3573) Dataset

2014-10-29 Thread Evan Sparks (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188919#comment-14188919
 ] 

Evan Sparks commented on SPARK-3573:


This comment originally appeared on the PR associated with this feature. 
(https://github.com/apache/spark/pull/2919):

I've looked at the code here, and it basically seems reasonable. One high-level 
concern I have is around the programming pattern that this encourages: complex 
nesting of otherwise simple structure that may make it difficult to program 
against Datasets for sufficiently complicated applications.

A 'dataset' is now a collection of Row, where we have the guarantee that all 
rows in a Dataset conform to the same schema. A schema is a list of (name, 
type) pairs which describe the attributes available in the dataset. This seems 
like a good thing to me, and is pretty much what we described in MLI (and how 
conventional databases have been structured forever). So far, so good.

The concern that I have is that we are now encouraging these attributes to be 
complex types. For example, where I might have had 
val x = Schema(('a', classOf[String]), ('b', classOf[Double]), ..., (z, 
classOf[Double]))
This would become
val x = Schema(('a', classOf[String]), ('bGroup', classOf[Vector]), .., 
(zGroup, classOf[Vector]))

So, great, my schema now has these vector things in them, which I can create 
separately, pass around, etc.

This clearly has its merits:
1) Features are groups together logically based on the process that creates 
them.
2) Managing one short schema where each record is comprised of a few large 
objects (say, 4 vectors, each of length 1000) is probably easier than managing 
a really big schema comprised of lots small objects (say, 4000 doubles).

But, there are some major drawbacks
1) Why only stop at one level of nesting? Why not have Vector[Vector]? 
2) How do learning algorithms, like SVM or PCA deal with these Datasets? Is 
there an implicit conversion that flattens these things to RDD[LabeledPoint]? 
Do we want to guarantee these semantics?
3) Manipulating and subsetting nested schemas like this might be tricky. Where 
before I might be able to write:

val x: Dataset = input.select(Seq(0,1,2,4,180,181,1000,1001,1002))
now I might have to write
val groupSelections = Seq(Seq(0,1,2,4),Seq(0,1),Seq(0,1,2))
val x: Dataset = groupSelections.zip(input.columns).map {case (gs, col) = 
col(gs) }

Ignoring raw syntax and semantics of how you might actually map an operation 
over the columns of a Dataset and get back a well-structured dataset, I think 
this makes two conflicting points:
1) In the first example - presumably all the work goes into figuring out what 
the subset of features you want is in this really wide feature space.
2) In the second example - there’s a lot of gymnastics that goes into 
subsetting feature groups. I think it’s clear that working with lots of feature 
groups might get unreasonable pretty quickly.

If we look at R or pandas/scikit-learn as examples of projects that have 
(arguably quite successfully) dealt with these interface issues, there is one 
basic pattern: learning algorithms expect big tables of numbers as input. Even 
here, there are some important differences:

For example, in scikit-learn, categorical features aren’t supported directly by 
most learning algorithms. Instead, users are responsible for getting data from 
“table with heterogenously typed columns” to “table of numbers.” with something 
like OneHotEncoder and other feature transformers. In R, on the other hand, 
categorical features are (sometimes frustratingly) first class citizens by 
virtue of the “factor” data type - which is essentially and enum. Most 
out-of-the-box learning algorithms (like glm()) accept data frames with 
categorical inputs and handle them sensibly - implicitly one hot encoding (or 
creating dummy variables, if you prefer) the categorical features.

While I have a slight preference for representing things as big flat tables, I 
would be fine coding either way - but I wanted to raise the issue for 
discussion here before the interfaces are set in stone.

 Dataset
 ---

 Key: SPARK-3573
 URL: https://issues.apache.org/jira/browse/SPARK-3573
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra 
 ML-specific metadata embedded in its schema.
 .Sample code
 Suppose we have training events stored on HDFS and user/ad features in Hive, 
 we want to assemble features for training and then apply decision tree.
 The proposed pipeline with dataset looks like the following (need more 
 refinements):
 {code}
 sqlContext.jsonFile(/path/to/training/events, 
 0.01).registerTempTable(event)
 

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-10-02 Thread Evan Sparks (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156699#comment-14156699
 ] 

Evan Sparks commented on SPARK-1405:


Hi Guoqiang - is it correct that your runtimes are reported in minutes as 
opposed to seconds? In your tests, have you cached the input data? 45 minutes 
for 150 iterations over this small dataset seems slow to me. It would be great 
to get an idea of where the bottleneck is coming from. Is it the Gibbs step or 
something else?

Is it possible to share the dataset you used for these experiments?

Thanks!

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans

2014-09-03 Thread Evan Sparks (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120432#comment-14120432
 ] 

Evan Sparks commented on SPARK-3384:


I agree with Sean. Avoiding the costly penalty of object allocation overhead is 
important to avoid here. As far as I can tell, we are using reduceByKey in the 
prescribed way (see: 
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3cecd3c09a-50f3-4683-a639-daddc4101...@gmail.com%3E)
 mutating the left input. I don't believe that spark needs this mutation to be 
thread-safe, because it executes the combine sequentially on all workers, and 
then reduces sequentially on the master, but I could be wrong.

 Potential thread unsafe Breeze vector addition in KMeans
 

 Key: SPARK-3384
 URL: https://issues.apache.org/jira/browse/SPARK-3384
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: RJ Nowling

 In the KMeans clustering implementation, the Breeze vectors are accumulated 
 using +=.  For example,
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162
  This is potentially a thread unsafe operation.  (This is what I observed in 
 local testing.)  I suggest changing the += to + -- a new object will be 
 allocated but it will be thread safe since it won't write to an old location 
 accessed by multiple threads.
 Further testing is required to reproduce and verify.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans

2014-09-03 Thread Evan Sparks (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120432#comment-14120432
 ] 

Evan Sparks edited comment on SPARK-3384 at 9/3/14 8:54 PM:


I agree with Sean. Avoiding the costly penalty of object allocation overhead is 
important here. As far as I can tell, we are using reduceByKey in the 
prescribed way (see: 
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3cecd3c09a-50f3-4683-a639-daddc4101...@gmail.com%3E)
 mutating the left input. I don't believe that spark needs this mutation to be 
thread-safe, because it executes the combine sequentially on all workers, and 
then reduces sequentially on the master, but I could be wrong.


was (Author: sparks):
I agree with Sean. Avoiding the costly penalty of object allocation overhead is 
important to avoid here. As far as I can tell, we are using reduceByKey in the 
prescribed way (see: 
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3cecd3c09a-50f3-4683-a639-daddc4101...@gmail.com%3E)
 mutating the left input. I don't believe that spark needs this mutation to be 
thread-safe, because it executes the combine sequentially on all workers, and 
then reduces sequentially on the master, but I could be wrong.

 Potential thread unsafe Breeze vector addition in KMeans
 

 Key: SPARK-3384
 URL: https://issues.apache.org/jira/browse/SPARK-3384
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: RJ Nowling

 In the KMeans clustering implementation, the Breeze vectors are accumulated 
 using +=.  For example,
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162
  This is potentially a thread unsafe operation.  (This is what I observed in 
 local testing.)  I suggest changing the += to + -- a new object will be 
 allocated but it will be thread safe since it won't write to an old location 
 accessed by multiple threads.
 Further testing is required to reproduce and verify.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org