Re: PySpark on PyPi

2015-07-24 Thread Jeremy Freeman
Hey all, great discussion, just wanted to +1 that I see a lot of value in steps 
that make it easier to use PySpark as an ordinary python library.

You might want to check out this (https://github.com/minrk/findspark 
https://github.com/minrk/findspark), started by Jupyter project devs, that 
offers one way to facilitate this stuff. I’ve also cced them here to join the 
conversation.

Also, @Jey, I can also confirm that at least in some scenarios (I’ve done it in 
an EC2 cluster in standalone mode) it’s possible to run PySpark jobs just using 
`from pyspark import SparkContext; sc = SparkContext(master=“X”)` so long as 
the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are set correctly 
on *both* workers and driver. That said, there’s definitely additional 
configuration / functionality that would require going through the proper 
submit scripts.

 On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal punya.bis...@gmail.com 
 wrote:
 
 I agree with everything Justin just said. An additional advantage of 
 publishing PySpark's Python code in a standards-compliant way is the fact 
 that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way 
 that pip can use. Contrast this with the current situation, where 
 df.toPandas() exists in the Spark API but doesn't actually work until you 
 install Pandas.
 
 Punya
 On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com 
 mailto:justin.u...@gmail.com wrote:
 // + Davies for his comments
 // + Punya for SA
 
 For development and CI, like Olivier mentioned, I think it would be hugely 
 beneficial to publish pyspark (only code in the python/ dir) on PyPI. If 
 anyone wants to develop against PySpark APIs, they need to download the 
 distribution and do a lot of PYTHONPATH munging for all the tools (pylint, 
 pytest, IDE code completion). Right now that involves adding python/ and 
 python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more 
 dependencies, we would have to manually mirror all the PYTHONPATH munging in 
 the ./pyspark script. With a proper pyspark setup.py which declares its 
 dependencies, and a published distribution, depending on pyspark will just be 
 adding pyspark to my setup.py dependencies.
 
 Of course, if we actually want to run parts of pyspark that is backed by Py4J 
 calls, then we need the full spark distribution with either ./pyspark or 
 ./spark-submit, but for things like linting and development, the PYTHONPATH 
 munging is very annoying.
 
 I don't think the version-mismatch issues are a compelling reason to not go 
 ahead with PyPI publishing. At runtime, we should definitely enforce that the 
 version has to be exact, which means there is no backcompat nightmare as 
 suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267 
 https://issues.apache.org/jira/browse/SPARK-1267. This would mean that even 
 if the user got his pip installed pyspark to somehow get loaded before the 
 spark distribution provided pyspark, then the user would be alerted 
 immediately.
 
 Davies, if you buy this, should me or someone on my team pick up 
 https://issues.apache.org/jira/browse/SPARK-1267 
 https://issues.apache.org/jira/browse/SPARK-1267 and 
 https://github.com/apache/spark/pull/464 
 https://github.com/apache/spark/pull/464?
 
 On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot 
 o.girar...@lateral-thoughts.com mailto:o.girar...@lateral-thoughts.com 
 wrote:
 Ok, I get it. Now what can we do to improve the current situation, because 
 right now if I want to set-up a CI env for PySpark, I have to :
 1- download a pre-built version of pyspark and unzip it somewhere on every 
 agent
 2- define the SPARK_HOME env 
 3- symlink this distribution pyspark dir inside the python install dir 
 site-packages/ directory
 and if I rely on additional packages (like databricks' Spark-CSV project), I 
 have to (except if I'm mistaken) 
 4- compile/assembly spark-csv, deploy the jar in a specific directory on 
 every agent
 5- add this jar-filled directory to the Spark distribution's additional 
 classpath using the conf/spark-default file 
 
 Then finally we can launch our unit/integration-tests. 
 Some issues are related to spark-packages, some to the lack of python-based 
 dependency, and some to the way SparkContext are launched when using pyspark.
 I think step 1 and 2 are fair enough
 4 and 5 may already have solutions, I didn't check and considering 
 spark-shell is downloading such dependencies automatically, I think if 
 nothing's done yet it will (I guess ?).
 
 For step 3, maybe just adding a setup.py to the distribution would be enough, 
 I'm not exactly advocating to distribute a full 300Mb spark distribution in 
 PyPi, maybe there's a better compromise ?
 
 Regards, 
 
 Olivier.
 
 Le ven. 5 juin 2015 à 22:12, Jey Kottalam j...@cs.berkeley.edu 
 mailto:j...@cs.berkeley.edu a écrit :
 Couldn't we have a pip installable pyspark package that just serves as a 
 shim to an existing Spark 

Re: Which method do you think is better for making MIN_REMEMBER_DURATION configurable?

2015-04-08 Thread Jeremy Freeman
+1 for this feature

In our use case, we probably wouldn’t use this feature in production, but it 
can be useful during prototyping and algorithm development to repeatedly 
perform the same streaming operation on a fixed, already existing set of files.

-
jeremyfreeman.net
@thefreemanlab

On Apr 8, 2015, at 2:51 PM, Emre Sevinc emre.sev...@gmail.com wrote:

 Tathagata,
 
 Thanks for stating your preference for Approach 2.
 
 My use case and motivation are similar to the concerns raised by others in
 SPARK-3276. In previous versions of Spark, e.g. 1.1.x we had the ability
 for Spark Streaming applications to process the files in an input directory
 that existed before the streaming application began, and for some projects
 that we did for our customers, we relied on that feature. Starting from
 1.2.x series, we are limited in this respect to the files whose time stamp
 is not older than 1 minute. The only workaround is to 'touch' those files
 before starting a streaming application.
 
 Moreover, this MIN_REMEMBER_DURATION is set to an arbitrary value of 1
 minute, and I don't see any argument why it cannot be set to another
 arbitrary value (keeping the default value of 1 minute, if nothing is set
 by the user).
 
 Putting all this together, my plan is to create a Pull Request that is like
 
  1- Convert private val MIN_REMEMBER_DURATION into private val
 minRememberDuration (to reflect the change that it is not a constant in
 the sense that it can be set via configuration)
 
  2- Set its value by using something like
 getConf(spark.streaming.minRememberDuration, Minutes(1))
 
  3- Document the spark.streaming.minRememberDuration in Spark Streaming
 Programming Guide
 
 If the above sounds fine, then I'll go on implementing this small change
 and submit a pull request for fixing SPARK-3276.
 
 What do you say?
 
 Kind regards,
 
 Emre Sevinç
 http://www.bigindustries.be/
 
 
 On Wed, Apr 8, 2015 at 7:16 PM, Tathagata Das t...@databricks.com wrote:
 
 Approach 2 is definitely better  :)
 Can you tell us more about the use case why you want to do this?
 
 TD
 
 On Wed, Apr 8, 2015 at 1:44 AM, Emre Sevinc emre.sev...@gmail.com wrote:
 
 Hello,
 
 This is about SPARK-3276 and I want to make MIN_REMEMBER_DURATION (that is
 now a constant) a variable (configurable, with a default value). Before
 spending effort on developing something and creating a pull request, I
 wanted to consult with the core developers to see which approach makes
 most
 sense, and has the higher probability of being accepted.
 
 The constant MIN_REMEMBER_DURATION can be seen at:
 
 
 
 https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L338
 
 it is marked as private member of private[streaming] object
 FileInputDStream.
 
 Approach 1: Make MIN_REMEMBER_DURATION a variable, with a new name of
 minRememberDuration, and then  add a new fileStream method to
 JavaStreamingContext.scala :
 
 
 
 https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
 
 such that the new fileStream method accepts a new parameter, e.g.
 minRememberDuration: Int (in seconds), and then use this value to set the
 private minRememberDuration.
 
 
 Approach 2: Create a new, public Spark configuration property, e.g. named
 spark.rememberDuration.min (with a default value of 60 seconds), and then
 set the private variable minRememberDuration to the value of this Spark
 property.
 
 
 Approach 1 would mean adding a new method to the public API, Approach 2
 would mean creating a new public Spark property. Right now, approach 2
 seems more straightforward and simpler to me, but nevertheless I wanted to
 have the opinions of other developers who know the internals of Spark
 better than I do.
 
 Kind regards,
 Emre Sevinç
 
 
 
 
 
 -- 
 Emre Sevinc



Re: Google Summer of Code - ideas

2015-02-26 Thread Jeremy Freeman
For topic #4 (streaming ML in Python), there’s an existing JIRA, but progress 
seems to have stalled. I’d be happy to help if you want to pick it up!

https://issues.apache.org/jira/browse/SPARK-4127

-
jeremyfreeman.net
@thefreemanlab

On Feb 26, 2015, at 4:20 PM, Xiangrui Meng men...@gmail.com wrote:

 There are couple things in Scala/Java but missing in Python API:
 
 1. model import/export
 2. evaluation metrics
 3. distributed linear algebra
 4. streaming algorithms
 
 If you are interested, we can list/create target JIRAs and hunt them
 down one by one.
 
 Best,
 Xiangrui
 
 On Wed, Feb 25, 2015 at 7:37 PM, Manoj Kumar
 manojkumarsivaraj...@gmail.com wrote:
 Hi,
 
 I think that would be really good. Are there any specific issues that are to
 be implemented as per priority?
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 



Re: Adding third party jars to classpath used by pyspark

2014-12-29 Thread Jeremy Freeman
Hi Stephen, it should be enough to include 

 --jars /path/to/file.jar

in the command line call to either pyspark or spark-submit, as in

 spark-submit --master local --jars /path/to/file.jar myfile.py

and you can check the bottom of the Web UI’s “Environment tab to make sure the 
jar gets on your classpath. Let me know if you still see errors related to this.

— Jeremy

-
jeremyfreeman.net
@thefreemanlab

On Dec 29, 2014, at 7:55 PM, Stephen Boesch java...@gmail.com wrote:

 What is the recommended way to do this?  We have some native database
 client libraries for which we are adding pyspark bindings.
 
 The pyspark invokes spark-submit.   Do we add our libraries to
 the SPARK_SUBMIT_LIBRARY_PATH ?
 
 This issue relates back to an error we have been seeing Py4jError: Trying
 to call a package - the suspicion being that the third party libraries may
 not be available on the jvm side.



Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-02 Thread Jeremy Freeman
+1 (non-binding)

Installed version pre-built for Hadoop on a private HPC
ran PySpark shell w/ iPython
loaded data using custom Hadoop input formats
ran MLlib routines in PySpark
ran custom workflows in PySpark
browsed the web UI

Noticeable improvements in stability and performance during large shuffles (as 
well as the elimination of frequent but unpredictable “FileNotFound / too many 
open files” errors).

We initially hit errors during large collects that ran fine in 1.1, but setting 
the new spark.driver.maxResultSize to 0 preserved the old behavior. Definitely 
worth highlighting this setting in the release notes, as the new default may be 
too small for some users and workloads.

— Jeremy

-
jeremyfreeman.net
@thefreemanlab

On Dec 2, 2014, at 3:22 AM, Denny Lee denny.g@gmail.com wrote:

 +1 (non-binding)
 
 Verified on OSX 10.10.2, built from source,
 spark-shell / spark-submit jobs
 ran various simple Spark / Scala queries
 ran various SparkSQL queries (including HiveContext)
 ran ThriftServer service and connected via beeline
 ran SparkSVD
 
 
 On Mon Dec 01 2014 at 11:09:26 PM Patrick Wendell pwend...@gmail.com
 wrote:
 
 Hey All,
 
 Just an update. Josh, Andrew, and others are working to reproduce
 SPARK-4498 and fix it. Other than that issue no serious regressions
 have been reported so far. If we are able to get a fix in for that
 soon, we'll likely cut another RC with the patch.
 
 Continued testing of RC1 is definitely appreciated!
 
 I'll leave this vote open to allow folks to continue posting comments.
 It's fine to still give +1 from your own testing... i.e. you can
 assume at this point SPARK-4498 will be fixed before releasing.
 
 - Patrick
 
 On Mon, Dec 1, 2014 at 3:30 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 +0.9 from me. Tested it on Mac and Windows (someone has to do it) and
 while things work, I noticed a few recent scripts don't have Windows
 equivalents, namely https://issues.apache.org/jira/browse/SPARK-4683 and
 https://issues.apache.org/jira/browse/SPARK-4684. The first one at least
 would be good to fix if we do another RC. Not blocking the release but
 useful to fix in docs is https://issues.apache.org/jira/browse/SPARK-4685.
 
 Matei
 
 
 On Dec 1, 2014, at 11:18 AM, Josh Rosen rosenvi...@gmail.com wrote:
 
 Hi everyone,
 
 There's an open bug report related to Spark standalone which could be a
 potential release-blocker (pending investigation / a bug fix):
 https://issues.apache.org/jira/browse/SPARK-4498.  This issue seems
 non-deterministc and only affects long-running Spark standalone
 deployments, so it may be hard to reproduce.  I'm going to work on a patch
 to add additional logging in order to help with debugging.
 
 I just wanted to give an early head's up about this issue and to get
 more eyes on it in case anyone else has run into it or wants to help with
 debugging.
 
 - Josh
 
 On November 28, 2014 at 9:18:09 PM, Patrick Wendell (pwend...@gmail.com)
 wrote:
 
 Please vote on releasing the following candidate as Apache Spark
 version 1.2.0!
 
 The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 1056e9ec13203d0c51564265e94d77a054498fdb
 
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.0-rc1/
 
 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc
 
 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1048/
 
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/
 
 Please vote on releasing this package as Apache Spark 1.2.0!
 
 The vote is open until Tuesday, December 02, at 05:15 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Spark 1.1.0
 [ ] -1 Do not release this package because ...
 
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 
 == What justifies a -1 vote for this release? ==
 This vote is happening very late into the QA period compared with
 previous votes, so -1 votes should only occur for significant
 regressions from 1.0.2. Bugs already present in 1.1.X, minor
 regressions, or bugs related to new features will not block this
 release.
 
 == What default changes should I be aware of? ==
 1. The default value of spark.shuffle.blockTransferService has been
 changed to netty
 -- Old behavior can be restored by switching to nio
 
 2. The default value of spark.shuffle.manager has been changed to
 sort.
 -- Old behavior can be restored by setting spark.shuffle.manager to
 hash.
 
 == Other notes ==
 Because this vote is occurring over a weekend, I will likely extend
 the vote if this RC survives until the end of the vote period.
 
 - Patrick
 
 

Re: Python3 and spark 1.1.0

2014-11-06 Thread Jeremy Freeman
Currently, Spark 1.1.0 works with Python 2.6 or higher, but not Python 3. There 
does seem to be interest, see also this post 
(http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-on-python-3-td15706.html).

I believe Ariel Rokem (cced) has been trying to get it work and might be 
working on a PR. It would probably be good to create a JIRA ticket for this.

— Jeremy

-
jeremyfreeman.net
@thefreemanlab

On Nov 6, 2014, at 6:01 PM, catchmonster skacan...@gmail.com wrote:

 Hi,
 I am interested in py3 with spark! Simply everything that I am developing in
 py is happening on the py3 side.
 is there plan to integrate spark 1.1.0 or UP with py3...
 it seems that is not supported in current latest version ...
 
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Python3-and-spark-1-1-0-tp9180.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 



Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Jeremy Freeman
Great idea! +1

— Jeremy

-
jeremyfreeman.net
@thefreemanlab

On Nov 5, 2014, at 11:48 PM, Timothy Chen tnac...@gmail.com wrote:

 Matei that makes sense, +1 (non-binding)
 
 Tim
 
 On Wed, Nov 5, 2014 at 8:46 PM, Cheng Lian lian.cs@gmail.com wrote:
 +1 since this is already the de facto model we are using.
 
 On Thu, Nov 6, 2014 at 12:40 PM, Wangfei (X) wangf...@huawei.com wrote:
 
 +1
 
 发自我的 iPhone
 
 在 2014年11月5日,20:06,Denny Lee denny.g@gmail.com 写道:
 
 +1 great idea.
 On Wed, Nov 5, 2014 at 20:04 Xiangrui Meng men...@gmail.com wrote:
 
 +1 (binding)
 
 On Wed, Nov 5, 2014 at 7:52 PM, Mark Hamstra m...@clearstorydata.com
 wrote:
 +1 (binding)
 
 On Wed, Nov 5, 2014 at 6:29 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com
 wrote:
 
 +1 on this proposal.
 
 On Wed, Nov 5, 2014 at 8:55 PM, Nan Zhu zhunanmcg...@gmail.com
 wrote:
 
 Will these maintainers have a cleanup for those pending PRs upon we
 start
 to apply this model?
 
 
 I second Nan's question. I would like to see this initiative drive a
 reduction in the number of stale PRs we have out there. We're
 approaching
 300 open PRs again.
 
 Nick
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 



Re: Building and Running Spark on OS X

2014-10-20 Thread Jeremy Freeman
I also prefer sbt on Mac.

You might want to add checking for / getting Python 2.6+ (though most modern 
Macs should have it), and maybe numpy as an optional dependency. I often just 
point people to Anaconda.

— Jeremy

-
jeremyfreeman.net
@thefreemanlab

On Oct 20, 2014, at 8:28 PM, Nicholas Chammas nicholas.cham...@gmail.com 
wrote:

 So back to my original question... :)
 
 If we wanted to post this guide to the user list or to a gist for easy
 reference, would we rather have Maven or SBT listed? And is there anything
 else about the steps that should be modified?
 
 Nick
 
 On Mon, Oct 20, 2014 at 8:25 PM, Sean Owen so...@cloudera.com wrote:
 
 Oh right, we're talking about the bundled sbt of course.
 And I didn't know Maven wasn't installed anymore!
 
 On Mon, Oct 20, 2014 at 8:20 PM, Hari Shreedharan
 hshreedha...@cloudera.com wrote:
 The sbt executable that is in the spark repo can be used to build sbt
 without any other set up (it will download the sbt jars etc).
 
 Thanks,
 Hari
 
 



Re: [mllib] Add multiplying large scale matrices

2014-09-05 Thread Jeremy Freeman
Hey all, 

Definitely agreed this would be nice! In our own work we've done element-wise 
addition, subtraction, and scalar multiplication of similarly partitioned 
matrices very efficiently with zipping. We've also done matrix-matrix 
multiplication with zipping, but that only works in certain circumstances, and 
it's otherwise very communication intensive (as Shivaram says). Another tricky 
thing with addition / subtraction is how to handle sparse vs. dense arrays.

Would be happy to contribute anything we did, but definitely first worth 
knowing what progress has been made from the AMPLab.

-- Jeremy

-
jeremy freeman, phd
neuroscientist
@thefreemanlab

On Sep 5, 2014, at 12:23 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey There,
 
 I believe this is on the roadmap for the 1.2 next release. But
 Xiangrui can comment on this.
 
 - Patrick
 
 On Fri, Sep 5, 2014 at 9:18 AM, Yu Ishikawa
 yuu.ishikawa+sp...@gmail.com wrote:
 Hi Evan,
 
 That's sounds interesting.
 
 Here is the ticket which I created.
 https://issues.apache.org/jira/browse/SPARK-3416
 
 thanks,
 
 
 
 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Add-multiplying-large-scale-matrices-tp8291p8296.html
 Sent from the Apache Spark Developers List mailing list archive at 
 Nabble.com.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 



RE: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Jeremy Freeman
+1



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8211.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Jeremy Freeman
+1. Validated several custom analysis pipelines on a private cluster in
standalone mode. Tested new PySpark support for arbitrary Hadoop input
formats, works great!

-- Jeremy



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC2-tp8107p8143.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread Jeremy Freeman
Hey RJ,

Sorry for the delay, I'd be happy to take a look at this if you can post the 
code!

I think splitting the largest cluster in each round is fairly common, but 
ideally it would be an option to do it one way or the other.

-- Jeremy

-
jeremy freeman, phd
neuroscientist
@thefreemanlab

On Aug 12, 2014, at 2:20 PM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,
 
 I wanted to follow up.
 
 I have a prototype for an optimized version of hierarchical k-means.  I
 wanted to get some feedback on my apporach.
 
 Jeremy's implementation splits the largest cluster in each round.  Is it
 better to do it that way or to split each cluster in half?
 
 Are there are any open-source examples that are being widely used in
 production?
 
 Thanks!
 
 
 
 On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling rnowl...@gmail.com wrote:
 
 Nice to meet you, Jeremy!
 
 This is great!  Hierarchical clustering was next on my list --
 currently trying to get my PR for MiniBatch KMeans accepted.
 
 If it's cool with you, I'll try converting your code to fit in with
 the existing MLLib code as you suggest. I also need to review the
 Decision Tree code (as suggested above) to see how much of that can be
 reused.
 
 Maybe I can ask you to do a code review for me when I'm done?
 
 
 
 
 
 On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
 freeman.jer...@gmail.com wrote:
 Hi all,
 
 Cool discussion! I agree that a more standardized API for clustering, and
 easy access to underlying routines, would be useful (we've also been
 discussing this when trying to develop streaming clustering algorithms,
 similar to https://github.com/apache/spark/pull/1361)
 
 For divisive, hierarchical clustering I implemented something awhile
 back,
 here's a gist.
 
 https://gist.github.com/freeman-lab/5947e7c53b368fe90371
 
 It does bisecting k-means clustering (with k=2), with a recursive class
 for
 keeping track of the tree. I also found this much better than
 agglomerative
 methods (for the reasons Hector points out).
 
 This needs to be cleaned up, and can surely be optimized (esp. by
 replacing
 the core KMeans step with existing MLLib code), but I can say I was
 running
 it successfully on quite large data sets.
 
 RJ, depending on where you are in your progress, I'd be happy to help
 work
 on this piece and / or have you use this as a jumping off point, if
 useful.
 
 -- Jeremy
 
 
 
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.
 
 
 
 --
 em rnowl...@gmail.com
 c 954.496.2314
 
 
 
 
 -- 
 em rnowl...@gmail.com
 c 954.496.2314



Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-14 Thread Jeremy Freeman
@Ignacio, happy to share, here's a link to a library we've been developing 
(https://github.com/freeman-lab/thunder). As just a couple examples, we have 
pipelines that use fourier transforms and other signal processing from scipy, 
and others that do massively parallel model fitting via Scikit learn functions, 
etc. That should give you some idea of how such libraries could be usefully 
integrated into a PySpark project. Btw, a couple things we do overlap with 
functionality now available in MLLib via the Python API, which we're working on 
integrating.

On Aug 13, 2014, at 5:16 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com 
wrote:

 Yep, I thought it was a bogus comparison.
 
 I should rephrase my question as it was poorly phrased: on average, how
 much faster is Spark v. PySpark (I didn't really mean Scala v. Python)?
 I've only used Spark and don't have a chance to test this at the moment so
 if anybody has these numbers or general estimates (10x, etc), that'd be
 great.
 
 @Jeremy, if you can discuss this, what's an example of a project you
 implemented using these libraries + PySpark?
 
 Thanks everyone!
 
 
 
 
 On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:
 
 On a related note, I recently heard about Distributed R
 https://github.com/vertica/DistributedR, which is coming out of
 HP/Vertica and seems to be their proposition for machine learning at scale.
 
 It would be interesting to see some kind of comparison between that and
 MLlib (and perhaps also SparkR
 https://github.com/amplab-extras/SparkR-pkg?), especially since
 Distributed R has a concept of distributed arrays and works on data
 in-memory. Docs are here.
 https://github.com/vertica/DistributedR/tree/master/doc/platform
 
 Nick
 
 
 On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin r...@databricks.com wrote:
 
 They only compared their own implementations of couple algorithms on
 different platforms rather than comparing the different platforms
 themselves (in the case of Spark -- PySpark). I can write two variants of
 an algorithm on Spark and make them perform drastically differently.
 
 I have no doubt if you implement a ML algorithm in Python itself without
 any native libraries, the performance will be sub-optimal.
 
 What PySpark really provides is:
 
 - Using Spark transformations in Python
 - ML algorithms implemented in Scala (leveraging native numerical
 libraries
 for high performance), and callable in Python
 
 The paper claims Python is now one of the most popular languages for
 ML-oriented programming, and that's why they went ahead with Python.
 However, as I understand, very few people actually implement algorithms in
 Python directly because of the sub-optimal performance. Most people
 implement algorithms in other languages (e.g. C / Java), and expose APIs
 in
 Python for ease-of-use. This is what we are trying to do with PySpark as
 well.
 
 
 On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas 
 ignacio.zendejas...@gmail.com wrote:
 
 Has anyone had a chance to look at this paper (with title in subject)?
 http://www.cs.rice.edu/~lp6/comparison.pdf
 
 Interesting that they chose to use Python alone. Do we know how much
 faster
 Scala is vs. Python in general, if at all?
 
 As with any and all benchmarks, I'm sure there are caveats, but it'd be
 nice to have a response to the question above for starters.
 
 Thanks,
 Ignacio
 
 
 
 



Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Jeremy Freeman
Our experience matches Reynold's comments; pure-Python implementations of
anything are generally sub-optimal compared to pure Scala implementations,
or Scala versions exposed to Python (which are faster, but still slower than
pure Scala). It also seems on first glance that some of the implementations
in the paper themselves might not have been optimal (regardless of Python vs
Scala).

All that said, we have found it useful to implement some workflows purely in
Python, mainly when we want to exploit libraries like NumPy, SciPy, or
Scikit Learn, or incorporate existing Python code bases, in which case the
flexibility is worth a drop in performance, at least for us! This might also
make more sense for specialized routines as opposed to core, low-level
algorithms.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/A-Comparison-of-Platforms-for-Implementing-and-Running-Very-Large-Scale-Machine-Learning-Algorithms-tp7823p7825.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Re:How to run specific sparkSQL test with maven

2014-08-01 Thread Jeremy Freeman
With maven you can run a particular test suite like this:

mvn -DwildcardSuites=org.apache.spark.sql.SQLQuerySuite test

see the note here (under Spark Tests in Maven):

http://spark.apache.org/docs/latest/building-with-maven.html



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-run-specific-sparkSQL-test-with-maven-tp7624p7626.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-19 Thread Jeremy Freeman
Hi RJ, that sounds like a great idea. I'd be happy to look over what you put
together.

-- Jeremy



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7418.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-17 Thread Jeremy Freeman
Hi all, 

Cool discussion! I agree that a more standardized API for clustering, and
easy access to underlying routines, would be useful (we've also been
discussing this when trying to develop streaming clustering algorithms,
similar to https://github.com/apache/spark/pull/1361) 

For divisive, hierarchical clustering I implemented something awhile back,
here's a gist. 

https://gist.github.com/freeman-lab/5947e7c53b368fe90371

It does bisecting k-means clustering (with k=2), with a recursive class for
keeping track of the tree. I also found this much better than agglomerative
methods (for the reasons Hector points out).

This needs to be cleaned up, and can surely be optimized (esp. by replacing
the core KMeans step with existing MLLib code), but I can say I was running
it successfully on quite large data sets. 

RJ, depending on where you are in your progress, I'd be happy to help work
on this piece and / or have you use this as a jumping off point, if useful. 

-- Jeremy 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.