Hello community, I would like to introduce a new Spark package that should be useful for python users who depend on scikit-learn.
Among other tools: - train and evaluate multiple scikit-learn models in parallel. - convert Spark's Dataframes seamlessly into numpy arrays - (experimental) distribute Scipy's sparse matrices as a dataset of sparse vectors. Spark-sklearn focuses on problems that have a small amount of data and that can be run in parallel. Note this package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib). If you want to use it, see instructions on the package page: https://github.com/databricks/spark-sklearn This blog post contains more details: https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html Let us know if you have any questions. Also, documentation or code contributions are much welcome (Apache 2.0 license). Cheers Tim and Joseph --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org