Hey all, great discussion, just wanted to +1 that I see a lot of value in steps
that make it easier to use PySpark as an ordinary python library.
You might want to check out this (https://github.com/minrk/findspark
https://github.com/minrk/findspark), started by Jupyter project devs, that
+1 for this feature
In our use case, we probably wouldn’t use this feature in production, but it
can be useful during prototyping and algorithm development to repeatedly
perform the same streaming operation on a fixed, already existing set of files.
-
jeremyfreeman.net
For topic #4 (streaming ML in Python), there’s an existing JIRA, but progress
seems to have stalled. I’d be happy to help if you want to pick it up!
https://issues.apache.org/jira/browse/SPARK-4127
-
jeremyfreeman.net
@thefreemanlab
On Feb 26, 2015, at 4:20 PM, Xiangrui
Hi Stephen, it should be enough to include
--jars /path/to/file.jar
in the command line call to either pyspark or spark-submit, as in
spark-submit --master local --jars /path/to/file.jar myfile.py
and you can check the bottom of the Web UI’s “Environment tab to make sure the
jar gets on
+1 (non-binding)
Installed version pre-built for Hadoop on a private HPC
ran PySpark shell w/ iPython
loaded data using custom Hadoop input formats
ran MLlib routines in PySpark
ran custom workflows in PySpark
browsed the web UI
Noticeable improvements in stability and performance during large
Currently, Spark 1.1.0 works with Python 2.6 or higher, but not Python 3. There
does seem to be interest, see also this post
(http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-on-python-3-td15706.html).
I believe Ariel Rokem (cced) has been trying to get it work and might be
working
Great idea! +1
— Jeremy
-
jeremyfreeman.net
@thefreemanlab
On Nov 5, 2014, at 11:48 PM, Timothy Chen tnac...@gmail.com wrote:
Matei that makes sense, +1 (non-binding)
Tim
On Wed, Nov 5, 2014 at 8:46 PM, Cheng Lian lian.cs@gmail.com wrote:
+1 since this is
I also prefer sbt on Mac.
You might want to add checking for / getting Python 2.6+ (though most modern
Macs should have it), and maybe numpy as an optional dependency. I often just
point people to Anaconda.
— Jeremy
-
jeremyfreeman.net
@thefreemanlab
On Oct 20, 2014,
.
-- Jeremy
-
jeremy freeman, phd
neuroscientist
@thefreemanlab
On Sep 5, 2014, at 12:23 PM, Patrick Wendell pwend...@gmail.com wrote:
Hey There,
I believe this is on the roadmap for the 1.2 next release. But
Xiangrui can comment on this.
- Patrick
On Fri, Sep 5
+1
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8211.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
+1. Validated several custom analysis pipelines on a private cluster in
standalone mode. Tested new PySpark support for arbitrary Hadoop input
formats, works great!
-- Jeremy
--
View this message in context:
Hey RJ,
Sorry for the delay, I'd be happy to take a look at this if you can post the
code!
I think splitting the largest cluster in each round is fairly common, but
ideally it would be an option to do it one way or the other.
-- Jeremy
-
jeremy freeman, phd
neuroscientist
@Ignacio, happy to share, here's a link to a library we've been developing
(https://github.com/freeman-lab/thunder). As just a couple examples, we have
pipelines that use fourier transforms and other signal processing from scipy,
and others that do massively parallel model fitting via Scikit
Our experience matches Reynold's comments; pure-Python implementations of
anything are generally sub-optimal compared to pure Scala implementations,
or Scala versions exposed to Python (which are faster, but still slower than
pure Scala). It also seems on first glance that some of the
With maven you can run a particular test suite like this:
mvn -DwildcardSuites=org.apache.spark.sql.SQLQuerySuite test
see the note here (under Spark Tests in Maven):
http://spark.apache.org/docs/latest/building-with-maven.html
--
View this message in context:
Hi RJ, that sounds like a great idea. I'd be happy to look over what you put
together.
-- Jeremy
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7418.html
Sent from the Apache Spark
Hi all,
Cool discussion! I agree that a more standardized API for clustering, and
easy access to underlying routines, would be useful (we've also been
discussing this when trying to develop streaming clustering algorithms,
similar to https://github.com/apache/spark/pull/1361)
For divisive,
17 matches
Mail list logo