[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390782#comment-14390782 ] Evan Sparks commented on SPARK-6646: Guys - you're clearly ignoring prior work. The database community solved this problem 20 years ago with the Gubba project - a mature prototype [can be seen here|http://i.imgur.com/FJK7K9x.jpg]. Additionally, everyone knows that joins don't scale on iOS, and you'll never be able to build indexes on this platform. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3530) Pipeline and Parameters
[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344291#comment-14344291 ] Evan Sparks commented on SPARK-3530: We have looked at integrating Caffe with spark - probably the most logical place to do this (as a proof of concept) is via pyspark, since Caffe has python bindings. If you're going to use a pre-trained model in Caffe, this should parallelize well and might well suit your needs. Training the pipeline in parallel is trickier because the communication requirements for training these types of networks via mini-batch SGD are very high. If you're in the scenario where you want to instantiate a pre-trained Caffe network on all the workers (and only do it once), then a broadcast variable is probably a good way to go, and I'd expect that it fits nicely into the pipelines framework. Pipeline and Parameters --- Key: SPARK-3530 URL: https://issues.apache.org/jira/browse/SPARK-3530 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Fix For: 1.2.0 This part of the design doc is for pipelines and parameters. I put the design doc at https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing I will copy the proposed interfaces to this JIRA later. Some sample code can be viewed at: https://github.com/mengxr/spark-ml/ Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5705) Explore GPU-accelerated Linear Algebra Libraries
[ https://issues.apache.org/jira/browse/SPARK-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313398#comment-14313398 ] Evan Sparks commented on SPARK-5705: This JIRA is a continuation of this thread: http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-td10481.html To summarise - high-speed linear algebra operations including, but not limited to, matrix multiplies and solves have the potential to make certain machine learning operations faster on spark. However, we've got to be careful to balance the overheads of copying data/calling out to the GPU with other factors in the design of the system. Additionally - getting these libraries compiled, linked, built, and configured on a target system is unfortunately not trivial. We should make sure we have a standard process for doing this (perhaps starting with this codebase: http://github.com/shivaram/matrix-bench). Maybe we should start with some applications where we think GPU acceleration could help? Neural nets is one, LDA is another - others? Explore GPU-accelerated Linear Algebra Libraries Key: SPARK-5705 URL: https://issues.apache.org/jira/browse/SPARK-5705 Project: Spark Issue Type: Bug Components: MLlib Reporter: Evan Sparks Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5705) Explore GPU-accelerated Linear Algebra Libraries
Evan Sparks created SPARK-5705: -- Summary: Explore GPU-accelerated Linear Algebra Libraries Key: SPARK-5705 URL: https://issues.apache.org/jira/browse/SPARK-5705 Project: Spark Issue Type: Bug Components: MLlib Reporter: Evan Sparks Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4565) Add docs about advanced spark application development
[ https://issues.apache.org/jira/browse/SPARK-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223864#comment-14223864 ] Evan Sparks commented on SPARK-4565: [~pwendell] suggested that we add this to the spark tuning guide and programming guide. What do you think? Add docs about advanced spark application development - Key: SPARK-4565 URL: https://issues.apache.org/jira/browse/SPARK-4565 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Evan Sparks Priority: Minor [~shivaram], [~jegonzal] and I have been working on a brief document based on our experiences writing high performance spark applications - MLlib, GraphX, pipelines, ml-matrix, etc. It currently exists here - https://docs.google.com/document/d/1gEIawzRsOwksV_bq4je3ofnd-7Xu-u409mdW-RXTDnQ/edit?usp=sharing Would it make sense to add these tips and tricks to the Spark Wiki? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4565) Add docs about advanced spark application development
Evan Sparks created SPARK-4565: -- Summary: Add docs about advanced spark application development Key: SPARK-4565 URL: https://issues.apache.org/jira/browse/SPARK-4565 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Evan Sparks Priority: Minor [~shivaram], [~jegonzal] and I have been working on a brief document based on our experiences writing high performance spark applications - MLlib, GraphX, pipelines, ml-matrix, etc. It currently exists here - https://docs.google.com/document/d/1gEIawzRsOwksV_bq4je3ofnd-7Xu-u409mdW-RXTDnQ/edit?usp=sharing Would it make sense to add these tips and tricks to the Spark Wiki? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222105#comment-14222105 ] Evan Sparks commented on SPARK-1405: [~gq] - Those are great numbers for a very high number of topics - it's a little tough to follow what's leading to the super-linear scaling in #topics in your code, though. Are you using FastLDA or something similar to speed up sampling? (http://www.ics.uci.edu/~newman/pubs/fastlda.pdf) Pedro has been testing on a wikipedia dump on s3 which I provided. It's XML formatted, one document per line, so it's easy to parse. I will copy this to a requester-pays bucket (which will be free if you run your experiments on ec2) now so that everyone working on this can use it for testing. NIPS dataset seems fine for small-scale testing, but I think it's important that we test this implementation across a range of values for documents, words, topics, and tokens - hence, I think the data generator that Pedro is working on is a really good idea (and follows the convention of the existing data generators in MLlib). We'll have to be a little careful here, because some of the methods for making LDA fast rely on the fact that it tends to converge fast, and I expect that data generated by the model will be much easier to fit than real data. Also, can we try and be consistent in our terminology - getting the # of unique words confused with all the words in a corpus is easy. I propose words and tokens for these two things. parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Priority: Critical Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222112#comment-14222112 ] Evan Sparks commented on SPARK-1405: Bucket has been created: s3://files.sparks.requester.pays/enwiki_category_text/ - All in all there are 181 ~50mb files (actually closer to 10GB). It probably makes sense to use http://sweble.org/ or something to strip the boilerplate, etc. from the documents for the purposes of topic modeling. parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Priority: Critical Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3573) Dataset
[ https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188919#comment-14188919 ] Evan Sparks commented on SPARK-3573: This comment originally appeared on the PR associated with this feature. (https://github.com/apache/spark/pull/2919): I've looked at the code here, and it basically seems reasonable. One high-level concern I have is around the programming pattern that this encourages: complex nesting of otherwise simple structure that may make it difficult to program against Datasets for sufficiently complicated applications. A 'dataset' is now a collection of Row, where we have the guarantee that all rows in a Dataset conform to the same schema. A schema is a list of (name, type) pairs which describe the attributes available in the dataset. This seems like a good thing to me, and is pretty much what we described in MLI (and how conventional databases have been structured forever). So far, so good. The concern that I have is that we are now encouraging these attributes to be complex types. For example, where I might have had val x = Schema(('a', classOf[String]), ('b', classOf[Double]), ..., (z, classOf[Double])) This would become val x = Schema(('a', classOf[String]), ('bGroup', classOf[Vector]), .., (zGroup, classOf[Vector])) So, great, my schema now has these vector things in them, which I can create separately, pass around, etc. This clearly has its merits: 1) Features are groups together logically based on the process that creates them. 2) Managing one short schema where each record is comprised of a few large objects (say, 4 vectors, each of length 1000) is probably easier than managing a really big schema comprised of lots small objects (say, 4000 doubles). But, there are some major drawbacks 1) Why only stop at one level of nesting? Why not have Vector[Vector]? 2) How do learning algorithms, like SVM or PCA deal with these Datasets? Is there an implicit conversion that flattens these things to RDD[LabeledPoint]? Do we want to guarantee these semantics? 3) Manipulating and subsetting nested schemas like this might be tricky. Where before I might be able to write: val x: Dataset = input.select(Seq(0,1,2,4,180,181,1000,1001,1002)) now I might have to write val groupSelections = Seq(Seq(0,1,2,4),Seq(0,1),Seq(0,1,2)) val x: Dataset = groupSelections.zip(input.columns).map {case (gs, col) = col(gs) } Ignoring raw syntax and semantics of how you might actually map an operation over the columns of a Dataset and get back a well-structured dataset, I think this makes two conflicting points: 1) In the first example - presumably all the work goes into figuring out what the subset of features you want is in this really wide feature space. 2) In the second example - there’s a lot of gymnastics that goes into subsetting feature groups. I think it’s clear that working with lots of feature groups might get unreasonable pretty quickly. If we look at R or pandas/scikit-learn as examples of projects that have (arguably quite successfully) dealt with these interface issues, there is one basic pattern: learning algorithms expect big tables of numbers as input. Even here, there are some important differences: For example, in scikit-learn, categorical features aren’t supported directly by most learning algorithms. Instead, users are responsible for getting data from “table with heterogenously typed columns” to “table of numbers.” with something like OneHotEncoder and other feature transformers. In R, on the other hand, categorical features are (sometimes frustratingly) first class citizens by virtue of the “factor” data type - which is essentially and enum. Most out-of-the-box learning algorithms (like glm()) accept data frames with categorical inputs and handle them sensibly - implicitly one hot encoding (or creating dummy variables, if you prefer) the categorical features. While I have a slight preference for representing things as big flat tables, I would be fine coding either way - but I wanted to raise the issue for discussion here before the interfaces are set in stone. Dataset --- Key: SPARK-3573 URL: https://issues.apache.org/jira/browse/SPARK-3573 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra ML-specific metadata embedded in its schema. .Sample code Suppose we have training events stored on HDFS and user/ad features in Hive, we want to assemble features for training and then apply decision tree. The proposed pipeline with dataset looks like the following (need more refinements): {code} sqlContext.jsonFile(/path/to/training/events, 0.01).registerTempTable(event)
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156699#comment-14156699 ] Evan Sparks commented on SPARK-1405: Hi Guoqiang - is it correct that your runtimes are reported in minutes as opposed to seconds? In your tests, have you cached the input data? 45 minutes for 150 iterations over this small dataset seems slow to me. It would be great to get an idea of where the bottleneck is coming from. Is it the Gibbs step or something else? Is it possible to share the dataset you used for these experiments? Thanks! parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans
[ https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120432#comment-14120432 ] Evan Sparks commented on SPARK-3384: I agree with Sean. Avoiding the costly penalty of object allocation overhead is important to avoid here. As far as I can tell, we are using reduceByKey in the prescribed way (see: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3cecd3c09a-50f3-4683-a639-daddc4101...@gmail.com%3E) mutating the left input. I don't believe that spark needs this mutation to be thread-safe, because it executes the combine sequentially on all workers, and then reduces sequentially on the master, but I could be wrong. Potential thread unsafe Breeze vector addition in KMeans Key: SPARK-3384 URL: https://issues.apache.org/jira/browse/SPARK-3384 Project: Spark Issue Type: Bug Components: MLlib Reporter: RJ Nowling In the KMeans clustering implementation, the Breeze vectors are accumulated using +=. For example, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162 This is potentially a thread unsafe operation. (This is what I observed in local testing.) I suggest changing the += to + -- a new object will be allocated but it will be thread safe since it won't write to an old location accessed by multiple threads. Further testing is required to reproduce and verify. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans
[ https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120432#comment-14120432 ] Evan Sparks edited comment on SPARK-3384 at 9/3/14 8:54 PM: I agree with Sean. Avoiding the costly penalty of object allocation overhead is important here. As far as I can tell, we are using reduceByKey in the prescribed way (see: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3cecd3c09a-50f3-4683-a639-daddc4101...@gmail.com%3E) mutating the left input. I don't believe that spark needs this mutation to be thread-safe, because it executes the combine sequentially on all workers, and then reduces sequentially on the master, but I could be wrong. was (Author: sparks): I agree with Sean. Avoiding the costly penalty of object allocation overhead is important to avoid here. As far as I can tell, we are using reduceByKey in the prescribed way (see: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3cecd3c09a-50f3-4683-a639-daddc4101...@gmail.com%3E) mutating the left input. I don't believe that spark needs this mutation to be thread-safe, because it executes the combine sequentially on all workers, and then reduces sequentially on the master, but I could be wrong. Potential thread unsafe Breeze vector addition in KMeans Key: SPARK-3384 URL: https://issues.apache.org/jira/browse/SPARK-3384 Project: Spark Issue Type: Bug Components: MLlib Reporter: RJ Nowling In the KMeans clustering implementation, the Breeze vectors are accumulated using +=. For example, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162 This is potentially a thread unsafe operation. (This is what I observed in local testing.) I suggest changing the += to + -- a new object will be allocated but it will be thread safe since it won't write to an old location accessed by multiple threads. Further testing is required to reproduce and verify. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org