Re: PySpark on PyPi
Hey all, great discussion, just wanted to +1 that I see a lot of value in steps that make it easier to use PySpark as an ordinary python library. You might want to check out this (https://github.com/minrk/findspark https://github.com/minrk/findspark), started by Jupyter project devs, that offers one way to facilitate this stuff. I’ve also cced them here to join the conversation. Also, @Jey, I can also confirm that at least in some scenarios (I’ve done it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)` so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are set correctly on *both* workers and driver. That said, there’s definitely additional configuration / functionality that would require going through the proper submit scripts. On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: I agree with everything Justin just said. An additional advantage of publishing PySpark's Python code in a standards-compliant way is the fact that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way that pip can use. Contrast this with the current situation, where df.toPandas() exists in the Spark API but doesn't actually work until you install Pandas. Punya On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com mailto:justin.u...@gmail.com wrote: // + Davies for his comments // + Punya for SA For development and CI, like Olivier mentioned, I think it would be hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI. If anyone wants to develop against PySpark APIs, they need to download the distribution and do a lot of PYTHONPATH munging for all the tools (pylint, pytest, IDE code completion). Right now that involves adding python/ and python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more dependencies, we would have to manually mirror all the PYTHONPATH munging in the ./pyspark script. With a proper pyspark setup.py which declares its dependencies, and a published distribution, depending on pyspark will just be adding pyspark to my setup.py dependencies. Of course, if we actually want to run parts of pyspark that is backed by Py4J calls, then we need the full spark distribution with either ./pyspark or ./spark-submit, but for things like linting and development, the PYTHONPATH munging is very annoying. I don't think the version-mismatch issues are a compelling reason to not go ahead with PyPI publishing. At runtime, we should definitely enforce that the version has to be exact, which means there is no backcompat nightmare as suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267 https://issues.apache.org/jira/browse/SPARK-1267. This would mean that even if the user got his pip installed pyspark to somehow get loaded before the spark distribution provided pyspark, then the user would be alerted immediately. Davies, if you buy this, should me or someone on my team pick up https://issues.apache.org/jira/browse/SPARK-1267 https://issues.apache.org/jira/browse/SPARK-1267 and https://github.com/apache/spark/pull/464 https://github.com/apache/spark/pull/464? On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot o.girar...@lateral-thoughts.com mailto:o.girar...@lateral-thoughts.com wrote: Ok, I get it. Now what can we do to improve the current situation, because right now if I want to set-up a CI env for PySpark, I have to : 1- download a pre-built version of pyspark and unzip it somewhere on every agent 2- define the SPARK_HOME env 3- symlink this distribution pyspark dir inside the python install dir site-packages/ directory and if I rely on additional packages (like databricks' Spark-CSV project), I have to (except if I'm mistaken) 4- compile/assembly spark-csv, deploy the jar in a specific directory on every agent 5- add this jar-filled directory to the Spark distribution's additional classpath using the conf/spark-default file Then finally we can launch our unit/integration-tests. Some issues are related to spark-packages, some to the lack of python-based dependency, and some to the way SparkContext are launched when using pyspark. I think step 1 and 2 are fair enough 4 and 5 may already have solutions, I didn't check and considering spark-shell is downloading such dependencies automatically, I think if nothing's done yet it will (I guess ?). For step 3, maybe just adding a setup.py to the distribution would be enough, I'm not exactly advocating to distribute a full 300Mb spark distribution in PyPi, maybe there's a better compromise ? Regards, Olivier. Le ven. 5 juin 2015 à 22:12, Jey Kottalam j...@cs.berkeley.edu mailto:j...@cs.berkeley.edu a écrit : Couldn't we have a pip installable pyspark package that just serves as a shim to an existing Spark
Re: Which method do you think is better for making MIN_REMEMBER_DURATION configurable?
+1 for this feature In our use case, we probably wouldn’t use this feature in production, but it can be useful during prototyping and algorithm development to repeatedly perform the same streaming operation on a fixed, already existing set of files. - jeremyfreeman.net @thefreemanlab On Apr 8, 2015, at 2:51 PM, Emre Sevinc emre.sev...@gmail.com wrote: Tathagata, Thanks for stating your preference for Approach 2. My use case and motivation are similar to the concerns raised by others in SPARK-3276. In previous versions of Spark, e.g. 1.1.x we had the ability for Spark Streaming applications to process the files in an input directory that existed before the streaming application began, and for some projects that we did for our customers, we relied on that feature. Starting from 1.2.x series, we are limited in this respect to the files whose time stamp is not older than 1 minute. The only workaround is to 'touch' those files before starting a streaming application. Moreover, this MIN_REMEMBER_DURATION is set to an arbitrary value of 1 minute, and I don't see any argument why it cannot be set to another arbitrary value (keeping the default value of 1 minute, if nothing is set by the user). Putting all this together, my plan is to create a Pull Request that is like 1- Convert private val MIN_REMEMBER_DURATION into private val minRememberDuration (to reflect the change that it is not a constant in the sense that it can be set via configuration) 2- Set its value by using something like getConf(spark.streaming.minRememberDuration, Minutes(1)) 3- Document the spark.streaming.minRememberDuration in Spark Streaming Programming Guide If the above sounds fine, then I'll go on implementing this small change and submit a pull request for fixing SPARK-3276. What do you say? Kind regards, Emre Sevinç http://www.bigindustries.be/ On Wed, Apr 8, 2015 at 7:16 PM, Tathagata Das t...@databricks.com wrote: Approach 2 is definitely better :) Can you tell us more about the use case why you want to do this? TD On Wed, Apr 8, 2015 at 1:44 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, This is about SPARK-3276 and I want to make MIN_REMEMBER_DURATION (that is now a constant) a variable (configurable, with a default value). Before spending effort on developing something and creating a pull request, I wanted to consult with the core developers to see which approach makes most sense, and has the higher probability of being accepted. The constant MIN_REMEMBER_DURATION can be seen at: https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L338 it is marked as private member of private[streaming] object FileInputDStream. Approach 1: Make MIN_REMEMBER_DURATION a variable, with a new name of minRememberDuration, and then add a new fileStream method to JavaStreamingContext.scala : https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala such that the new fileStream method accepts a new parameter, e.g. minRememberDuration: Int (in seconds), and then use this value to set the private minRememberDuration. Approach 2: Create a new, public Spark configuration property, e.g. named spark.rememberDuration.min (with a default value of 60 seconds), and then set the private variable minRememberDuration to the value of this Spark property. Approach 1 would mean adding a new method to the public API, Approach 2 would mean creating a new public Spark property. Right now, approach 2 seems more straightforward and simpler to me, but nevertheless I wanted to have the opinions of other developers who know the internals of Spark better than I do. Kind regards, Emre Sevinç -- Emre Sevinc
Re: Google Summer of Code - ideas
For topic #4 (streaming ML in Python), there’s an existing JIRA, but progress seems to have stalled. I’d be happy to help if you want to pick it up! https://issues.apache.org/jira/browse/SPARK-4127 - jeremyfreeman.net @thefreemanlab On Feb 26, 2015, at 4:20 PM, Xiangrui Meng men...@gmail.com wrote: There are couple things in Scala/Java but missing in Python API: 1. model import/export 2. evaluation metrics 3. distributed linear algebra 4. streaming algorithms If you are interested, we can list/create target JIRAs and hunt them down one by one. Best, Xiangrui On Wed, Feb 25, 2015 at 7:37 PM, Manoj Kumar manojkumarsivaraj...@gmail.com wrote: Hi, I think that would be really good. Are there any specific issues that are to be implemented as per priority? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Adding third party jars to classpath used by pyspark
Hi Stephen, it should be enough to include --jars /path/to/file.jar in the command line call to either pyspark or spark-submit, as in spark-submit --master local --jars /path/to/file.jar myfile.py and you can check the bottom of the Web UI’s “Environment tab to make sure the jar gets on your classpath. Let me know if you still see errors related to this. — Jeremy - jeremyfreeman.net @thefreemanlab On Dec 29, 2014, at 7:55 PM, Stephen Boesch java...@gmail.com wrote: What is the recommended way to do this? We have some native database client libraries for which we are adding pyspark bindings. The pyspark invokes spark-submit. Do we add our libraries to the SPARK_SUBMIT_LIBRARY_PATH ? This issue relates back to an error we have been seeing Py4jError: Trying to call a package - the suspicion being that the third party libraries may not be available on the jvm side.
Re: [VOTE] Release Apache Spark 1.2.0 (RC1)
+1 (non-binding) Installed version pre-built for Hadoop on a private HPC ran PySpark shell w/ iPython loaded data using custom Hadoop input formats ran MLlib routines in PySpark ran custom workflows in PySpark browsed the web UI Noticeable improvements in stability and performance during large shuffles (as well as the elimination of frequent but unpredictable “FileNotFound / too many open files” errors). We initially hit errors during large collects that ran fine in 1.1, but setting the new spark.driver.maxResultSize to 0 preserved the old behavior. Definitely worth highlighting this setting in the release notes, as the new default may be too small for some users and workloads. — Jeremy - jeremyfreeman.net @thefreemanlab On Dec 2, 2014, at 3:22 AM, Denny Lee denny.g@gmail.com wrote: +1 (non-binding) Verified on OSX 10.10.2, built from source, spark-shell / spark-submit jobs ran various simple Spark / Scala queries ran various SparkSQL queries (including HiveContext) ran ThriftServer service and connected via beeline ran SparkSVD On Mon Dec 01 2014 at 11:09:26 PM Patrick Wendell pwend...@gmail.com wrote: Hey All, Just an update. Josh, Andrew, and others are working to reproduce SPARK-4498 and fix it. Other than that issue no serious regressions have been reported so far. If we are able to get a fix in for that soon, we'll likely cut another RC with the patch. Continued testing of RC1 is definitely appreciated! I'll leave this vote open to allow folks to continue posting comments. It's fine to still give +1 from your own testing... i.e. you can assume at this point SPARK-4498 will be fixed before releasing. - Patrick On Mon, Dec 1, 2014 at 3:30 PM, Matei Zaharia matei.zaha...@gmail.com wrote: +0.9 from me. Tested it on Mac and Windows (someone has to do it) and while things work, I noticed a few recent scripts don't have Windows equivalents, namely https://issues.apache.org/jira/browse/SPARK-4683 and https://issues.apache.org/jira/browse/SPARK-4684. The first one at least would be good to fix if we do another RC. Not blocking the release but useful to fix in docs is https://issues.apache.org/jira/browse/SPARK-4685. Matei On Dec 1, 2014, at 11:18 AM, Josh Rosen rosenvi...@gmail.com wrote: Hi everyone, There's an open bug report related to Spark standalone which could be a potential release-blocker (pending investigation / a bug fix): https://issues.apache.org/jira/browse/SPARK-4498. This issue seems non-deterministc and only affects long-running Spark standalone deployments, so it may be hard to reproduce. I'm going to work on a patch to add additional logging in order to help with debugging. I just wanted to give an early head's up about this issue and to get more eyes on it in case anyone else has run into it or wants to help with debugging. - Josh On November 28, 2014 at 9:18:09 PM, Patrick Wendell (pwend...@gmail.com) wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.0! The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 1056e9ec13203d0c51564265e94d77a054498fdb The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.0-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1048/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/ Please vote on releasing this package as Apache Spark 1.2.0! The vote is open until Tuesday, December 02, at 05:15 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.1.X, minor regressions, or bugs related to new features will not block this release. == What default changes should I be aware of? == 1. The default value of spark.shuffle.blockTransferService has been changed to netty -- Old behavior can be restored by switching to nio 2. The default value of spark.shuffle.manager has been changed to sort. -- Old behavior can be restored by setting spark.shuffle.manager to hash. == Other notes == Because this vote is occurring over a weekend, I will likely extend the vote if this RC survives until the end of the vote period. - Patrick
Re: Python3 and spark 1.1.0
Currently, Spark 1.1.0 works with Python 2.6 or higher, but not Python 3. There does seem to be interest, see also this post (http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-on-python-3-td15706.html). I believe Ariel Rokem (cced) has been trying to get it work and might be working on a PR. It would probably be good to create a JIRA ticket for this. — Jeremy - jeremyfreeman.net @thefreemanlab On Nov 6, 2014, at 6:01 PM, catchmonster skacan...@gmail.com wrote: Hi, I am interested in py3 with spark! Simply everything that I am developing in py is happening on the py3 side. is there plan to integrate spark 1.1.0 or UP with py3... it seems that is not supported in current latest version ... -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Python3-and-spark-1-1-0-tp9180.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Designating maintainers for some Spark components
Great idea! +1 — Jeremy - jeremyfreeman.net @thefreemanlab On Nov 5, 2014, at 11:48 PM, Timothy Chen tnac...@gmail.com wrote: Matei that makes sense, +1 (non-binding) Tim On Wed, Nov 5, 2014 at 8:46 PM, Cheng Lian lian.cs@gmail.com wrote: +1 since this is already the de facto model we are using. On Thu, Nov 6, 2014 at 12:40 PM, Wangfei (X) wangf...@huawei.com wrote: +1 发自我的 iPhone 在 2014年11月5日,20:06,Denny Lee denny.g@gmail.com 写道: +1 great idea. On Wed, Nov 5, 2014 at 20:04 Xiangrui Meng men...@gmail.com wrote: +1 (binding) On Wed, Nov 5, 2014 at 7:52 PM, Mark Hamstra m...@clearstorydata.com wrote: +1 (binding) On Wed, Nov 5, 2014 at 6:29 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: +1 on this proposal. On Wed, Nov 5, 2014 at 8:55 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Will these maintainers have a cleanup for those pending PRs upon we start to apply this model? I second Nan's question. I would like to see this initiative drive a reduction in the number of stale PRs we have out there. We're approaching 300 open PRs again. Nick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Building and Running Spark on OS X
I also prefer sbt on Mac. You might want to add checking for / getting Python 2.6+ (though most modern Macs should have it), and maybe numpy as an optional dependency. I often just point people to Anaconda. — Jeremy - jeremyfreeman.net @thefreemanlab On Oct 20, 2014, at 8:28 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: So back to my original question... :) If we wanted to post this guide to the user list or to a gist for easy reference, would we rather have Maven or SBT listed? And is there anything else about the steps that should be modified? Nick On Mon, Oct 20, 2014 at 8:25 PM, Sean Owen so...@cloudera.com wrote: Oh right, we're talking about the bundled sbt of course. And I didn't know Maven wasn't installed anymore! On Mon, Oct 20, 2014 at 8:20 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: The sbt executable that is in the spark repo can be used to build sbt without any other set up (it will download the sbt jars etc). Thanks, Hari
Re: [mllib] Add multiplying large scale matrices
Hey all, Definitely agreed this would be nice! In our own work we've done element-wise addition, subtraction, and scalar multiplication of similarly partitioned matrices very efficiently with zipping. We've also done matrix-matrix multiplication with zipping, but that only works in certain circumstances, and it's otherwise very communication intensive (as Shivaram says). Another tricky thing with addition / subtraction is how to handle sparse vs. dense arrays. Would be happy to contribute anything we did, but definitely first worth knowing what progress has been made from the AMPLab. -- Jeremy - jeremy freeman, phd neuroscientist @thefreemanlab On Sep 5, 2014, at 12:23 PM, Patrick Wendell pwend...@gmail.com wrote: Hey There, I believe this is on the roadmap for the 1.2 next release. But Xiangrui can comment on this. - Patrick On Fri, Sep 5, 2014 at 9:18 AM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com wrote: Hi Evan, That's sounds interesting. Here is the ticket which I created. https://issues.apache.org/jira/browse/SPARK-3416 thanks, -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Add-multiplying-large-scale-matrices-tp8291p8296.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
RE: [VOTE] Release Apache Spark 1.1.0 (RC3)
+1 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8211.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
+1. Validated several custom analysis pipelines on a private cluster in standalone mode. Tested new PySpark support for arbitrary Hadoop input formats, works great! -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC2-tp8107p8143.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Hey RJ, Sorry for the delay, I'd be happy to take a look at this if you can post the code! I think splitting the largest cluster in each round is fairly common, but ideally it would be an option to do it one way or the other. -- Jeremy - jeremy freeman, phd neuroscientist @thefreemanlab On Aug 12, 2014, at 2:20 PM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I wanted to follow up. I have a prototype for an optimized version of hierarchical k-means. I wanted to get some feedback on my apporach. Jeremy's implementation splits the largest cluster in each round. Is it better to do it that way or to split each cluster in half? Are there are any open-source examples that are being widely used in production? Thanks! On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling rnowl...@gmail.com wrote: Nice to meet you, Jeremy! This is great! Hierarchical clustering was next on my list -- currently trying to get my PR for MiniBatch KMeans accepted. If it's cool with you, I'll try converting your code to fit in with the existing MLLib code as you suggest. I also need to review the Decision Tree code (as suggested above) to see how much of that can be reused. Maybe I can ask you to do a code review for me when I'm done? On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hi all, Cool discussion! I agree that a more standardized API for clustering, and easy access to underlying routines, would be useful (we've also been discussing this when trying to develop streaming clustering algorithms, similar to https://github.com/apache/spark/pull/1361) For divisive, hierarchical clustering I implemented something awhile back, here's a gist. https://gist.github.com/freeman-lab/5947e7c53b368fe90371 It does bisecting k-means clustering (with k=2), with a recursive class for keeping track of the tree. I also found this much better than agglomerative methods (for the reasons Hector points out). This needs to be cleaned up, and can surely be optimized (esp. by replacing the core KMeans step with existing MLLib code), but I can say I was running it successfully on quite large data sets. RJ, depending on where you are in your progress, I'd be happy to help work on this piece and / or have you use this as a jumping off point, if useful. -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -- em rnowl...@gmail.com c 954.496.2314 -- em rnowl...@gmail.com c 954.496.2314
Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms
@Ignacio, happy to share, here's a link to a library we've been developing (https://github.com/freeman-lab/thunder). As just a couple examples, we have pipelines that use fourier transforms and other signal processing from scipy, and others that do massively parallel model fitting via Scikit learn functions, etc. That should give you some idea of how such libraries could be usefully integrated into a PySpark project. Btw, a couple things we do overlap with functionality now available in MLLib via the Python API, which we're working on integrating. On Aug 13, 2014, at 5:16 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Yep, I thought it was a bogus comparison. I should rephrase my question as it was poorly phrased: on average, how much faster is Spark v. PySpark (I didn't really mean Scala v. Python)? I've only used Spark and don't have a chance to test this at the moment so if anybody has these numbers or general estimates (10x, etc), that'd be great. @Jeremy, if you can discuss this, what's an example of a project you implemented using these libraries + PySpark? Thanks everyone! On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On a related note, I recently heard about Distributed R https://github.com/vertica/DistributedR, which is coming out of HP/Vertica and seems to be their proposition for machine learning at scale. It would be interesting to see some kind of comparison between that and MLlib (and perhaps also SparkR https://github.com/amplab-extras/SparkR-pkg?), especially since Distributed R has a concept of distributed arrays and works on data in-memory. Docs are here. https://github.com/vertica/DistributedR/tree/master/doc/platform Nick On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin r...@databricks.com wrote: They only compared their own implementations of couple algorithms on different platforms rather than comparing the different platforms themselves (in the case of Spark -- PySpark). I can write two variants of an algorithm on Spark and make them perform drastically differently. I have no doubt if you implement a ML algorithm in Python itself without any native libraries, the performance will be sub-optimal. What PySpark really provides is: - Using Spark transformations in Python - ML algorithms implemented in Scala (leveraging native numerical libraries for high performance), and callable in Python The paper claims Python is now one of the most popular languages for ML-oriented programming, and that's why they went ahead with Python. However, as I understand, very few people actually implement algorithms in Python directly because of the sub-optimal performance. Most people implement algorithms in other languages (e.g. C / Java), and expose APIs in Python for ease-of-use. This is what we are trying to do with PySpark as well. On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Has anyone had a chance to look at this paper (with title in subject)? http://www.cs.rice.edu/~lp6/comparison.pdf Interesting that they chose to use Python alone. Do we know how much faster Scala is vs. Python in general, if at all? As with any and all benchmarks, I'm sure there are caveats, but it'd be nice to have a response to the question above for starters. Thanks, Ignacio
Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms
Our experience matches Reynold's comments; pure-Python implementations of anything are generally sub-optimal compared to pure Scala implementations, or Scala versions exposed to Python (which are faster, but still slower than pure Scala). It also seems on first glance that some of the implementations in the paper themselves might not have been optimal (regardless of Python vs Scala). All that said, we have found it useful to implement some workflows purely in Python, mainly when we want to exploit libraries like NumPy, SciPy, or Scikit Learn, or incorporate existing Python code bases, in which case the flexibility is worth a drop in performance, at least for us! This might also make more sense for specialized routines as opposed to core, low-level algorithms. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/A-Comparison-of-Platforms-for-Implementing-and-Running-Very-Large-Scale-Machine-Learning-Algorithms-tp7823p7825.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Re:How to run specific sparkSQL test with maven
With maven you can run a particular test suite like this: mvn -DwildcardSuites=org.apache.spark.sql.SQLQuerySuite test see the note here (under Spark Tests in Maven): http://spark.apache.org/docs/latest/building-with-maven.html -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-run-specific-sparkSQL-test-with-maven-tp7624p7626.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Hi RJ, that sounds like a great idea. I'd be happy to look over what you put together. -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7418.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Hi all, Cool discussion! I agree that a more standardized API for clustering, and easy access to underlying routines, would be useful (we've also been discussing this when trying to develop streaming clustering algorithms, similar to https://github.com/apache/spark/pull/1361) For divisive, hierarchical clustering I implemented something awhile back, here's a gist. https://gist.github.com/freeman-lab/5947e7c53b368fe90371 It does bisecting k-means clustering (with k=2), with a recursive class for keeping track of the tree. I also found this much better than agglomerative methods (for the reasons Hector points out). This needs to be cleaned up, and can surely be optimized (esp. by replacing the core KMeans step with existing MLLib code), but I can say I was running it successfully on quite large data sets. RJ, depending on where you are in your progress, I'd be happy to help work on this piece and / or have you use this as a jumping off point, if useful. -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.