Re: PySpark on PyPi

2015-07-24 Thread Jeremy Freeman
Hey all, great discussion, just wanted to +1 that I see a lot of value in steps that make it easier to use PySpark as an ordinary python library. You might want to check out this (https://github.com/minrk/findspark https://github.com/minrk/findspark), started by Jupyter project devs, that

Re: Which method do you think is better for making MIN_REMEMBER_DURATION configurable?

2015-04-08 Thread Jeremy Freeman
+1 for this feature In our use case, we probably wouldn’t use this feature in production, but it can be useful during prototyping and algorithm development to repeatedly perform the same streaming operation on a fixed, already existing set of files. - jeremyfreeman.net

Re: Google Summer of Code - ideas

2015-02-26 Thread Jeremy Freeman
For topic #4 (streaming ML in Python), there’s an existing JIRA, but progress seems to have stalled. I’d be happy to help if you want to pick it up! https://issues.apache.org/jira/browse/SPARK-4127 - jeremyfreeman.net @thefreemanlab On Feb 26, 2015, at 4:20 PM, Xiangrui

Re: Adding third party jars to classpath used by pyspark

2014-12-29 Thread Jeremy Freeman
Hi Stephen, it should be enough to include --jars /path/to/file.jar in the command line call to either pyspark or spark-submit, as in spark-submit --master local --jars /path/to/file.jar myfile.py and you can check the bottom of the Web UI’s “Environment tab to make sure the jar gets on

Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-02 Thread Jeremy Freeman
+1 (non-binding) Installed version pre-built for Hadoop on a private HPC ran PySpark shell w/ iPython loaded data using custom Hadoop input formats ran MLlib routines in PySpark ran custom workflows in PySpark browsed the web UI Noticeable improvements in stability and performance during large

Re: Python3 and spark 1.1.0

2014-11-06 Thread Jeremy Freeman
Currently, Spark 1.1.0 works with Python 2.6 or higher, but not Python 3. There does seem to be interest, see also this post (http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-on-python-3-td15706.html). I believe Ariel Rokem (cced) has been trying to get it work and might be working

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Jeremy Freeman
Great idea! +1 — Jeremy - jeremyfreeman.net @thefreemanlab On Nov 5, 2014, at 11:48 PM, Timothy Chen tnac...@gmail.com wrote: Matei that makes sense, +1 (non-binding) Tim On Wed, Nov 5, 2014 at 8:46 PM, Cheng Lian lian.cs@gmail.com wrote: +1 since this is

Re: Building and Running Spark on OS X

2014-10-20 Thread Jeremy Freeman
I also prefer sbt on Mac. You might want to add checking for / getting Python 2.6+ (though most modern Macs should have it), and maybe numpy as an optional dependency. I often just point people to Anaconda. — Jeremy - jeremyfreeman.net @thefreemanlab On Oct 20, 2014,

Re: [mllib] Add multiplying large scale matrices

2014-09-05 Thread Jeremy Freeman
. -- Jeremy - jeremy freeman, phd neuroscientist @thefreemanlab On Sep 5, 2014, at 12:23 PM, Patrick Wendell pwend...@gmail.com wrote: Hey There, I believe this is on the roadmap for the 1.2 next release. But Xiangrui can comment on this. - Patrick On Fri, Sep 5

RE: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Jeremy Freeman
+1 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8211.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Jeremy Freeman
+1. Validated several custom analysis pipelines on a private cluster in standalone mode. Tested new PySpark support for arbitrary Hadoop input formats, works great! -- Jeremy -- View this message in context:

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread Jeremy Freeman
Hey RJ, Sorry for the delay, I'd be happy to take a look at this if you can post the code! I think splitting the largest cluster in each round is fairly common, but ideally it would be an option to do it one way or the other. -- Jeremy - jeremy freeman, phd neuroscientist

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-14 Thread Jeremy Freeman
@Ignacio, happy to share, here's a link to a library we've been developing (https://github.com/freeman-lab/thunder). As just a couple examples, we have pipelines that use fourier transforms and other signal processing from scipy, and others that do massively parallel model fitting via Scikit

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Jeremy Freeman
Our experience matches Reynold's comments; pure-Python implementations of anything are generally sub-optimal compared to pure Scala implementations, or Scala versions exposed to Python (which are faster, but still slower than pure Scala). It also seems on first glance that some of the

Re: Re:How to run specific sparkSQL test with maven

2014-08-01 Thread Jeremy Freeman
With maven you can run a particular test suite like this: mvn -DwildcardSuites=org.apache.spark.sql.SQLQuerySuite test see the note here (under Spark Tests in Maven): http://spark.apache.org/docs/latest/building-with-maven.html -- View this message in context:

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-19 Thread Jeremy Freeman
Hi RJ, that sounds like a great idea. I'd be happy to look over what you put together. -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7418.html Sent from the Apache Spark

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-17 Thread Jeremy Freeman
Hi all, Cool discussion! I agree that a more standardized API for clustering, and easy access to underlying routines, would be useful (we've also been discussing this when trying to develop streaming clustering algorithms, similar to https://github.com/apache/spark/pull/1361) For divisive,