Re: [GitHub] spark pull request: MLI-2: Start adding k-fold cross validation to...

2014-03-06 Thread Holden Karau
Sure, unique from MLI-2? On Thu, Mar 6, 2014 at 2:15 PM, mengxr g...@git.apache.org wrote: Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/18#issuecomment-36944266 LGTM, except the extra empty line. Do you mind creating a Spark JIRA for this

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-27 Thread Holden Karau
+1 (I did some very basic testing with PySpark Pandas on rc11) On Tue, May 27, 2014 at 3:53 PM, Mark Hamstra m...@clearstorydata.comwrote: +1 On Tue, May 27, 2014 at 9:26 AM, Ankur Dave ankurd...@gmail.com wrote: 0 OK, I withdraw my downvote. Ankur http://www.ankurdave.com/

Re: Easy win: SBT plugin config expert to help on SPARK-3359?

2014-10-22 Thread Holden Karau
Hi Sean, I've pushed a PR for this https://github.com/apache/spark/pull/2893 :) Cheers, Holden :) On Tue, Oct 21, 2014 at 4:41 AM, Sean Owen so...@cloudera.com wrote: This one can be resolved, I think, with a bit of help from someone who understands SBT + plugin config:

Re: Development testing code

2014-10-22 Thread Holden Karau
Hi, Many tests in pyspark are implemented as doctests and the python unittesting framework is also used for additional tests. Cheers, Holden :) On Wed, Oct 22, 2014 at 4:13 PM, catchmonster skacan...@gmail.com wrote: Hi, If developing in python, what is preffered way to do unit testing? Do

Re: Adding Spark Testing functionality

2015-10-12 Thread Holden Karau
So here is a quick description of the current testing bits (I can expand on it if people are interested) http://bit.ly/pandaPandaPanda . On Tue, Oct 6, 2015 at 3:49 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > I'll put together a google doc and send that out (in the meantime a quic

Re: Live UI

2015-10-12 Thread Holden Karau
I don't think there has been much work done with ScalaJS and Spark (outside of the April fools press release), but there is a live Web UI project out of hammerlab with Ryan Williams https://github.com/hammerlab/spree which you may want to take a look at. On Mon, Oct 12, 2015 at 2:36 PM, Jakob

Adding Spark Testing functionality

2015-10-06 Thread Holden Karau
Hi Spark Devs, So this has been brought up a few times before, and generally on the user list people get directed to use spark-testing-base. I'd like to start moving some of spark-testing-base's functionality into Spark so that people don't need a library to do what is (hopefully :p) a very

Re: Adding Spark Testing functionality

2015-10-06 Thread Holden Karau
some high level view of the functionality you imagine > would be helpful to give more detailed feedback. > > - Patrick > > On Tue, Oct 6, 2015 at 3:12 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > >> Hi Spark Devs, >> >> So this has been brought up a few t

Re: [VOTE] Release Apache Spark 1.4.1 (RC4)

2015-07-09 Thread Holden Karau
+1 - compiled on ubuntu centos, spark-perf run against yarn in client mode on a small cluster comparing 1.4.0 1.4.1 (for core) doesn't have any huge jumps (albeit with a small scaling factor). On Wed, Jul 8, 2015 at 11:58 PM, Patrick Wendell pwend...@gmail.com wrote: +1 On Wed, Jul 8, 2015

Building with sbt impossible to get artifacts when data has not been loaded

2015-08-26 Thread Holden Karau
Has anyone else run into impossible to get artifacts when data has not been loaded. IvyNode = org.scala-lang#scala-library;2.10.3 during hive/update when building with sbt. Working around it is pretty simple (just add it as a dependency), but I'm wondering if its impacting anyone else and I should

Re: [ML] Missing documentation for the IndexToString feature transformer

2015-12-05 Thread Holden Karau
I'd be more than happy to help review the docs if that would be useful :) On Sat, Dec 5, 2015 at 2:21 PM, Joseph Bradley wrote: > Thanks for reporting this! I just added a JIRA: > https://issues.apache.org/jira/browse/SPARK-12159 > That would be great if you could send a

Re: JIRA SPARK-2984

2016-06-09 Thread Holden Karau
I'd do some searching and see if there is a JIRA related to this problem on s3 and if you don't find one go ahead and make one. Even if it is an intrinsic problem with s3 (and I'm not super sure since I'm just reading this on mobile) - it would maybe be a good thing for us to document. On

Re: JIRA SPARK-2984

2016-06-09 Thread Holden Karau
I think your error could possibly be different - looking at the original JIRA the issue was happening on HDFS and you seem to be experiencing the issue on s3n, and while I don't have full view of the problem I could see this being s3 specific (read-after-write on s3 is trickier than

Re: ImportError: No module named numpy

2016-06-01 Thread Holden Karau
Generally this means numpy isn't installed on the system or your PYTHONPATH has somehow gotten pointed somewhere odd, On Wed, Jun 1, 2016 at 8:31 AM, Bhupendra Mishra wrote: > If any one please can help me with following error. > > File >

Re: Creating a python port for a Scala Spark Projeect

2016-06-22 Thread Holden Karau
PySpark RDDs are (on the Java side) are essentially RDD of pickled objects and mostly (but not entirely) opaque to the JVM. It is possible (by using some internals) to pass a PySpark DataFrame to a Scala library (you may or may not find the talk I gave at Spark Summit useful

Re: How to run PySpark tests?

2016-02-18 Thread Holden Karau
I've run into some problems with the Python tests in the past when I haven't built with hive support, you might want to build your assembly with hive support and see if that helps. On Thursday, February 18, 2016, Jason White wrote: > Hi, > > I'm trying to finish up a PR

Re: How to run PySpark tests?

2016-02-18 Thread Holden Karau
Great - I'll update the wiki. On Thu, Feb 18, 2016 at 8:34 PM, Jason White wrote: > Compiling with `build/mvn -Pyarn -Phadoop-2.4 -Phive -Dhadoop.version=2.4.0 > -DskipTests clean package` followed by `python/run-tests` seemed to do the > trick! Thanks! > > > > -- >

Re: Write access to wiki

2016-02-19 Thread Holden Karau
Any chance I could also get write access to the wiki? I'd like to update some of the PySpark documentation in the wiki. On Tue, Jan 12, 2016 at 10:14 AM, shane knapp wrote: > > Ok, sounds good. I think it would be great, if you could add installing > the > > 'docker-engine'

Re: How to run PySpark tests?

2016-02-19 Thread Holden Karau
Or wait I don't have access to the wiki - if anyone can give me wiki access I'll update the instructions. On Thu, Feb 18, 2016 at 8:45 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > Great - I'll update the wiki. > > On Thu, Feb 18, 2016 at 8:34 PM, Jason White <jason.wh...@sh

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-04-05 Thread Holden Karau
One minor downside to having both 2.10 and 2.11 (and eventually 2.12) is deprecation warnings in our builds that we can't fix without introducing a wrapper/ scala version specific code. This isn't a big deal, and if we drop 2.10 in the 3-6 month time frame talked about we can cleanup those

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Holden Karau
I'm very much in favor of this, the less porting work there is the better :) On Tue, Apr 5, 2016 at 5:32 PM, Joseph Bradley wrote: > +1 By the way, the JIRA for tracking (Scala) API parity is: > https://issues.apache.org/jira/browse/SPARK-4591 > > On Tue, Apr 5, 2016 at

Re: persist versus checkpoint

2016-04-30 Thread Holden Karau
They are different, also this might be better suited for the user list. Persist by default will cache in memory on one machine, although you can specify a different storage level. Checkpoint on the other hand will write out to a persistent store and get rid of the dependency graph used to compute

Re: [VOTE] Removing module maintainer process

2016-05-23 Thread Holden Karau
+1 non-binding (as a contributor anything which speed things up is worth a try, and git blame is a good enough substitute for the list when figuring out who to ping on a PR). On Monday, May 23, 2016, Imran Rashid wrote: > +1 (binding) > > On Mon, May 23, 2016 at 8:13 AM,

PySpark mixed with Jython

2016-05-15 Thread Holden Karau
I've been doing some looking at EclairJS (Spark + Javascript) which takes a really interesting approach. The driver program is run in node and the workers are run in nashorn. I was wondering if anyone has given much though to optionally exposing an interface for PySpark in a similar fashion. For

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Holden Karau
Personally I'd rather err on the side of keeping PRs open, but I understand wanting to keep the open PRs limited to ones which have a reasonable chance of being merged. What about if we filtered for non-mergeable PRs or instead left a comment asking the author to respond if they are still

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-22 Thread Holden Karau
+1 (non-binding) Built locally on Ubuntu 14.04, basic pyspark sanity checking & tested with a simple structured streaming project (spark-structured-streaming-ml) & spark-testing-base & high-performance-spark-examples (minor changes required from preview version but seem intentional & jetty

Internal Deprecation warnings - worth fixing?

2016-07-27 Thread Holden Karau
Now that the 2.0 release is out the door and I've got some cycles to do some cleanups - I'd like to know what other people think of the internal deprecation warnings we've introduced in a lot of a places in our code. Once before I did some minor refactoring so the Python code which had to use the

Re: Internal Deprecation warnings - worth fixing?

2016-07-27 Thread Holden Karau
be called from tests to still test the deprecated code but it > ought to be possible to make the non-test code avoid it entirely. > > On Wed, Jul 27, 2016 at 12:11 PM, Holden Karau <hol...@pigscanfly.ca> > wrote: > > Now that the 2.0 release is out the door and I've got some

Re: How do a new developer create or assign a jira ticket?

2016-07-27 Thread Holden Karau
Hi Neil, Thanks for your interest in participating in Apache Spark! You can create JIRAs - but first you will need to signup for an Apache JIRA account. Generally we can't assign JIRAs to ourselves - but you can leave a comment saying your interested in working. I think for R a good place to get

Re: AccumulatorV2 += operator

2016-08-02 Thread Holden Karau
I believe it was intentional with the idea that it would be more unified between Java and Scala APIs. If your talking about the javadoc mention in https://github.com/apache/spark/pull/14466/files - I believe the += is meant to refer to what the internal implementation of the add function can be

Re: AccumulatorV2 += operator

2016-08-03 Thread Holden Karau
, August 3, 2016, Bryan Cutler <cutl...@gmail.com> wrote: > No, I was referring to the programming guide section on accumulators, it > says " Tasks running on a cluster can then add to it using the add method > or the += operator (in Scala and Python)." > > On Aug

Re: Spark Homepage

2016-07-13 Thread Holden Karau
This has also been reported on the user@ by a few people - other apache projects (arrow & hadoop) don't seem to be affected so maybe it was a just bad update for the Spark website? On Wed, Jul 13, 2016 at 12:05 PM, Dongjoon Hyun wrote: > Hi, All. > > Currently, Spark

Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-19 Thread Holden Karau
endell/spark-releases/spark-2.0.0-rc4-docs-updated/ > > On Tue, Jul 19, 2016 at 3:19 PM Holden Karau <hol...@pigscanfly.ca> wrote: > >> -1 : The docs don't seem to be fully built (e.g. >> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-docs/streami

Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-19 Thread Holden Karau
-1 : The docs don't seem to be fully built (e.g. http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-docs/streaming-programming-guide.html is a zero byte file currently) - although if this is a transient apache issue no worries. On Thu, Jul 14, 2016 at 11:59 AM, Reynold Xin

Structured Streaming Sink in 2.0 collect/foreach restrictions added in SPARK-16020

2016-06-28 Thread Holden Karau
Looking at the Sink in 2.0 there is a warning (added in SPARK-16020 without a lot of details) that says "Note: You cannot apply any operators on `data` except consuming it (e.g., `collect/foreach`)." but I'm wondering if this restriction is perhaps too broadly worded? Provided that we consume the

Re: Structured Streaming Sink in 2.0 collect/foreach restrictions added in SPARK-16020

2016-06-28 Thread Holden Karau
ult. Since this was discovered late > in the release process we decided it was better to document the current > behavior, rather than do a large refactoring. > > On Tue, Jun 28, 2016 at 12:59 PM, Holden Karau <hol...@pigscanfly.ca > <javascript:_e(%7B%7D,'cvml','hol...@

[PySPARK] - Py4J binary transfer survey

2016-07-06 Thread Holden Karau
Hi PySpark Devs, The Py4j developer has a survey up for Py4J users - https://github.com/bartdag/py4j/issues/237 it might be worth our time to provide some input on how we are using and would like to be using Py4J if binary transfer was improved. I'm happy to fill it out with my thoughts - but if

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Holden Karau
Spark does not currently support Apache Arrow - probably a good place to chat would be on the Arrow mailing list where they are making progress towards unified JVM & Python/R support which is sort of a precondition of a functioning Arrow interface between Spark and Python. On Fri, Aug 5, 2016 at

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Holden Karau
t; will support Arrow? I'd just like to know that all the pieces will come > together eventually. > > (In this forum, most of the discussion about Arrow is about PySpark and > Pandas, not Spark in general.) > > Best, > Jim > > On Aug 5, 2016 2:43 PM, "Holden Karau" &

Re: Google Summer of Code 2017 is coming

2017-02-03 Thread Holden Karau
As someone who did GSoC back in University I think this could be a good idea if there is enough interest from the PMC & I'd be willing the help mentor if that is a bottleneck. On Fri, Feb 3, 2017 at 12:42 PM, Jacek Laskowski wrote: > Hi, > > Is this something Spark considering?

Re: Is there any plan to have a predict method for single instance on PipelineModel?

2017-02-05 Thread Holden Karau
I'm in mobile right now but there is a JIRA to add it to the models first and on that JIRA people are discussing single element transform as a possibility - https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-10413 There might be others as well that just aren't as fresh in my

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Holden Karau
Also thanks everyone :) Looking forward to helping out (and if anyone wants to get started contributing to PySpark please ping me :)) On Tue, Jan 24, 2017 at 3:24 PM, Burak Yavuz wrote: > Thank you very much everyone! Hoping to help out the community as much as > I can! > >

Re: Design document - MLlib's statistical package for DataFrames

2017-02-18 Thread Holden Karau
It's at the bottom of every message (although some mail clients hide it for some reason), send an email to dev-unsubscr...@spark.apache.org On Sat, Feb 18, 2017 at 11:07 AM Pritish Nawlakhe < prit...@nirvana-international.com> wrote: > Hi > > Would anyone know how to unsubscribe to this list? >

[PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

2017-02-13 Thread Holden Karau
Hi PySpark Developers, Cloudpickle is a core part of PySpark, and is originally copied from (and improved from) picloud. Since then other projects have found cloudpickle useful and a fork of cloudpickle was created and is now maintained as its own

Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

2017-02-13 Thread Holden Karau
lways ask > this question: what's the benefit? In this case it looks like the benefit > is to reduce efforts in backports. Do you know how often we needed to do > those? > > > On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau <hol...@pigscanfly.ca> > wrote: > >> Hi Py

Re: Persisting PySpark ML Pipelines that include custom Transformers

2016-08-19 Thread Holden Karau
I don't think we've given a lot of thought to model persistence for custom Python models yet - if the Python models is wrapping a JVM model using the JavaMLWritable along with '_to_java' should work provided your Java model alread is saveable. On the other hand - if your model isn't wrapping a

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Holden Karau
I'm seeing some test failures with Python 3 that could definitely be environmental (going to rebuild my virtual env and double check), I'm just wondering if other people are also running the Python tests on this release or if everyone is focused on the Scala tests? On Mon, Sep 26, 2016 at 11:48

StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-09-26 Thread Holden Karau
Hi Spark Developers, After some discussion on SPARK-16407 (and on the PR ) we’ve decided to jump back to the developer list (SPARK-16407 itself comes

Re: welcoming Xiao Li as a committer

2016-10-04 Thread Holden Karau
Congratulations :D :) Yay! On Tue, Oct 4, 2016 at 11:14 AM, Suresh Thalamati < suresh.thalam...@gmail.com> wrote: > Congratulations, Xiao! > > > > > On Oct 3, 2016, at 10:46 PM, Reynold Xin wrote: > > > > Hi all, > > > > Xiao Li, aka gatorsmile, has recently been elected as

Re: Spark Improvement Proposals

2016-10-07 Thread Holden Karau
First off, thanks Cody for taking the time to put together these proposals - I think it has kicked off some wonderful discussion. I think dismissing people's complaints with Spark as largely trolls does us a disservice, it’s important for us to recognize our own shortcomings - otherwise we are

Re: PSA: JIRA resolutions and meanings

2016-10-08 Thread Holden Karau
We could certainly do that system - but given the current somewhat small set of active committers its clearly not scaling very well. There are many developers in Spark like Hyukjin, Cody, and myself who care about specific areas and can verify if an issue is still present in mainline. That being

PySpark UDF Performance Exploration w/Jython (Early/rough 2~3X improvement*) [SPARK-15369]

2016-10-05 Thread Holden Karau
Hi Python Spark Developers & Users, As Datasets/DataFrames are becoming the core building block of Spark, and as someone who cares about Python Spark performance, I've been looking more at PySpark UDF performance. I've got an early WIP/request for comments pull request open

Early Draft Structured Streaming Machine Learning

2016-08-18 Thread Holden Karau
Hi Everyone (that cares about structured streaming and ML), Seth and I have been giving some thought to support structured streaming in machine learning - we've put together an early design doc (its been in JIRA (SPARK-16424) for awhile, but

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Holden Karau
test >> code. Not running the tests from the distribution. >> Cheers >> >> >> On Mon, Sep 26, 2016 at 11:59 AM, Holden Karau <hol...@pigscanfly.ca> >> wrote: >> >>> I'm seeing some test failures with Python 3 that could definitely be >>>

Re: Using mention-bot to automatically ping potential reviewers

2016-11-06 Thread Holden Karau
So according the documentation it mostly uses blame lines which _might_ not be the best fit for Spark (since many of the people in the blame lines aren't going to have permission to commit the code). (Although it's possible that the algorithm that is actually used does more than the one described

Re: Handling questions in the mailing lists

2016-11-10 Thread Holden Karau
That's a good question, looking at http://stackoverflow.com/tags/apache-spark/topusers shows a few contributors who have already been active on SO including some committers and PMC members with very high overall SO reputations for any administrative needs (as well as a number of other

Re: Contributing to PySpark

2016-10-18 Thread Holden Karau
Hi Krishna, Thanks for your interest contributing to PySpark! I don't personally use either of those IDEs so I'll leave that part for someone else to answer - but in general you can find the building spark documentation at http://spark.apache.org/docs/latest/building-spark.html which includes

Mini-Proposal: Make it easier to contribute to the contributing to Spark Guide

2016-10-18 Thread Holden Karau
Right now the wiki isn't particularly accessible to updates by external contributors. We've already got a contributing to spark page which just links to the wiki - how about if we just move the wiki contents over? This way contributors can contribute to our documentation about how to contribute

Re: On convenience methods

2016-10-18 Thread Holden Karau
I think what Reynold means is that if its easy for a developer to build this convenience function using the current Spark API it probably doesn't need to go into Spark unless its being done to provide a similar API to a system we are attempting to be semi-compatible with (e.g. if a corresponding

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Holden Karau
I'd also like to add Python 2.6 to the list of things. We've considered dropping it before but never followed through to the best of my knowledge (although on mobile right now so can't double check). On Tuesday, October 25, 2016, Sean Owen wrote: > I'd like to gauge where

Re: Spark Wiki now migrated to spark.apache.org

2016-11-23 Thread Holden Karau
That's awesome thanks for doing the migration :) On Wed, Nov 23, 2016 at 3:29 AM Sean Owen wrote: > I completed the migration. You can see the results live right now at > http://spark.apache.org, and > https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage > > A

Re: issues with github pull request notification emails missing

2016-11-16 Thread Holden Karau
+1 it seems like I'm missing a number of my GitHub email notifications lately (although since I run my own mail server and forward I've been assuming it's my own fault). I've also had issues with having greatly delayed notifications on some of my own pull requests but that might be unrelated.

Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Holden Karau
I've been working on a blog post around this and hope to have it published early next month  On Nov 17, 2016 10:16 PM, "Joseph Bradley" wrote: Hi Georg, It's true we need better documentation for this. I'd recommend checking out simple algorithms within Spark for

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-31 Thread Holden Karau
I believe Bryan is also working on this a little - and I'm a little busy with the other stuff but would love to stay in the loop on Arrow progress :) On Monday, October 31, 2016, mariusvniekerk wrote: > So i've been working on some very very early stage apache arrow

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-11-01 Thread Holden Karau
On that note there is some discussion on the Jira - https://issues.apache.org/jira/browse/SPARK-13534 :) On Mon, Oct 31, 2016 at 8:32 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > I believe Bryan is also working on this a little - and I'm a little busy > with the other stuff bu

Blocked PySpark changes

2016-11-02 Thread Holden Karau
Hi Spark Developers & Maintainers, I know we've been talking a lot about what we want changes we want in PySpark to help keep it interesting and usable (see http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-td19422.html).

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-13 Thread Holden Karau
This is a thing I often have people ask me about, and then I do my best dissuade them from using Spark in the "hot path" and it's normally something which most people eventually accept. Fred might have more information for people for whom this is a hard requirement though. On Thursday, October

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-13 Thread Holden Karau
Awesome, good points everyone. The ranking of the issues is super useful and I'd also completely forgotten about the lack of built in UDAF support which is rather important. There is a PR to make it easier to call/register JVM UDFs from Python which will hopefully help a bit there too. I'm getting

Re: Improving governance / committers (split from Spark Improvement Proposals thread)

2016-10-10 Thread Holden Karau
I think it is really important to ensure that someone with a good understanding of Kafka is empowered around this component with a formal voice around - but I don't have much dev experience with our Kafka connectors so I can't speak to the specifics around it personally. More generally, I also

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-16 Thread Holden Karau
Thanks for the specific mention of the new PySpark packaging Shivaram, For *nix (Linux, Unix, OS X, etc.) Python users interested in helping test the new artifacts you can do as follows: Setup PySpark with pip by: 1. Download the artifact from

Re: [PYSPARK] Python tests organization

2017-01-12 Thread Holden Karau
I'd be happy to help with reviewing Python test improvements. Maybe make an umbrella JIRA and do one sub components at a time? On Thu, Jan 12, 2017 at 12:20 PM Saikat Kanjilal wrote: > > > > > > > > > > > > > > > Following up, any thoughts on next steps for this? > > > > >

Re: Can I add a new method to RDD class?

2016-12-05 Thread Holden Karau
Doing that requires publishing a custom version of Spark, you can edit the version number do do a publishLocal - but maintaining that change is going to be difficult. The other approaches suggested are probably better, but also does your method need to be defined on the RDD class? Could you

Re: scala.MatchError: scala.collection.immutable.Range.Inclusive from catalyst.ScalaReflection.serializerFor?

2017-01-09 Thread Holden Karau
If you want to check if it's your modifications or just in mainline, you can always just checkout mainline or stash your current changes to rebuild (this is something I do pretty often when I run into bugs I don't think I would have introduced). On Mon, Jan 9, 2017 at 1:01 AM Liang-Chi Hsieh

Re: A note about MLlib's StandardScaler

2017-01-08 Thread Holden Karau
Hi Gilad, Spark uses the sample standard variance inside of the StandardScaler (see https://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler ) which I think would explain the results you are seeing you are seeing. I believe the scalers are intended to

Re: handling of empty partitions

2017-01-08 Thread Holden Karau
Hi Georg, Thanks for the question along with the code (as well as posting to stack overflow). In general if a question is well suited for stackoverflow its probably better suited to the user@ list instead of the dev@ list so I've cc'd the user@ list for you. As far as handling empty partitions

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-18 Thread Holden Karau
+1 (non-binding) - checked Python artifacts with virtual env. On Sun, Dec 18, 2016 at 11:42 AM Denny Lee wrote: > +1 (non-binding) > > > On Sat, Dec 17, 2016 at 11:45 PM Liwei Lin wrote: > > +1 > > Cheers, > Liwei > > > > On Sat, Dec 17, 2016 at 10:29

Re: Outstanding Spark 2.1.1 issues

2017-03-30 Thread Holden Karau
t; the release process. Will post RC1 once I get it working. >>>> >>>> On Tue, Mar 21, 2017 at 2:16 PM, Nick Pentreath < >>>> nick.pentre...@gmail.com> wrote: >>>> >>>>> As for SPARK-19759 <https://issues.apache.org/jira/browse/SPARK-

[Important for PySpark Devs]: Master now tests with Python 2.7 rather than 2.6 - please retest any Python PRs

2017-03-29 Thread Holden Karau
Hi PySpark Developers, In https://issues.apache.org/jira/browse/SPARK-19955 / https://github.com/apache/spark/pull/17355, as part of our continued Python 2.6 deprecation https://issues.apache.org/jira/browse/SPARK-15902 & eventual removal https://issues.apache.org/jira/browse/SPARK-12661 ,

Re: Should we consider a Spark 2.1.1 release?

2017-03-19 Thread Holden Karau
is a good idea. > > > > On Sun, Mar 19, 2017 at 6:24 AM, Jacek Laskowski <ja...@japila.pl> > wrote: > >> > >> +1 > >> > >> More smaller and more frequent releases (so major releases get even more > >> quality). > &

Re: Should we consider a Spark 2.1.1 release?

2017-03-20 Thread Holden Karau
stable versions not frequent ones. > A lot of people still have 1.6.x in production. Those who wants the > freshest (like me) can always deploy night builts. > My question is: how long version 1.6 will be supported? > > > On Sunday, March 19, 2017, Holden Karau <hol...@pigsca

Outstanding Spark 2.1.1 issues

2017-03-20 Thread Holden Karau
Hi Spark Developers! As we start working on the Spark 2.1.1 release I've been looking at our outstanding issues still targeted for it. I've tried to break it down by component so that people in charge of each component can take a quick look and see if any of these things can/should be re-targeted

Re: Outstanding Spark 2.1.1 issues

2017-03-20 Thread Holden Karau
RK-19925 > > > > > -- > *From:* holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of > Holden Karau <hol...@pigscanfly.ca> > *Sent:* Monday, March 20, 2017 3:12:35 PM > *To:* dev@spark.apache.org > *Subject:* Outstanding Spark

Re: Outstanding Spark 2.1.1 issues

2017-03-21 Thread Holden Karau
hese seem like critical > regressions from 2.1. As such I'll start the RC process later today. > > On Mon, Mar 20, 2017 at 9:52 PM, Holden Karau <hol...@pigscanfly.ca> > wrote: > >> I'm not super sure it should be a blocker for 2.1.1 -- is it a >> regression? Maybe we

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-04 Thread Holden Karau
ail.com> > wrote: > > -1 > sorry, found an issue with SparkR CRAN check. > Opened SPARK-20197 and working on fix. > > -- > *From:* holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of > Holden Karau <hol...@pigscanfly.ca>

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-04 Thread Holden Karau
See SPARK-20216, if Michael can let me know which machine is being used for packaging I can see if I can install pandoc on it (should be simple but I know the Jenkins cluster is a bit on the older side). On Tue, Apr 4, 2017 at 3:06 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > S

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-03-31 Thread Holden Karau
-1 (non-binding) Python packaging doesn't seem to have quite worked out (looking at PKG-INFO the description is "Description: ! missing pandoc do not upload to PyPI "), ideally it would be nice to have this as a version we upgrade to PyPi. Building this on my own machine results in a

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-05 Thread Holden Karau
Following up, the issues with missing pypandoc/pandoc on the packaging machine has been resolved. On Tue, Apr 4, 2017 at 3:54 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > See SPARK-20216, if Michael can let me know which machine is being used > for packaging I can see if I can in

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-14 Thread Holden Karau
r 13, 2017 at 9:39 PM, Holden Karau <hol...@pigscanfly.ca> > wrote: > >> If it would help I'd be more than happy to look at kicking off the >> packaging for RC3 since I'v been poking around in Jenkins a bit (for >> SPARK-20216 >> & friends) (I'd still proba

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-14 Thread Holden Karau
At first glance the error seems similar to one Pedro Rodriguez ran into during 2.0, so I'm looping Pedor in if they happen to have any insight into what was the cause last time. On Fri, Apr 14, 2017 at 4:40 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > Sure, let me dig into it :) &

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-14 Thread Holden Karau
with JDK7. What do people think? On Fri, Apr 14, 2017 at 4:53 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > At first glance the error seems similar to one Pedro Rodriguez ran into > during 2.0, so I'm looping Pedor in if they happen to have any insight into > what was the

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-13 Thread Holden Karau
critical bug that na.fill will mess up the data in Long > even > >> the data isn't null. > >> > >> Thanks. > >> > >> > >> Sincerely, > >> > >> DB Tsai > >> -- > >> Web: https://www.dbtsai.c

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-17 Thread Holden Karau
t; > wrote: > >> I've hit this before, where Javadoc for 1.8 is much more strict than 1.7. >> >> I think we should definitely use Java 1.7 for the release if we used it >> for the previous releases in the 2.1 line. We don't want to break java 1.7 >> users in a pat

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-24 Thread Holden Karau
Whats the regression this fixed in 2.1 from 2.0? On Fri, Apr 21, 2017 at 7:45 PM, Wenchen Fan wrote: > IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will only > scan all table files only once, and write back the inferred schema to > metastore so that we

Should we consider a Spark 2.1.1 release?

2017-03-13 Thread Holden Karau
Hi Spark Devs, Spark 2.1 has been out since end of December and we've got quite a few fixes merged for 2.1.1

Re: Should we consider a Spark 2.1.1 release?

2017-03-13 Thread Holden Karau
d this brought up in side > conversations before. > > > On Mon, Mar 13, 2017 at 7:07 PM Holden Karau <hol...@pigscanfly.ca> wrote: > > Hi Spark Devs, > > Spark 2.1 has been out since end of December > <http://apache-spark-developers-list.1001551.n3.nabble.com/AN

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-24 Thread Holden Karau
, Apr 24, 2017 at 11:01 AM, Holden Karau <hol...@pigscanfly.ca> wrote: > It > > On Mon, Apr 24, 2017 at 10:33 AM, Michael Allman <mich...@videoamp.com> > wrote: > >> The trouble we ran into is that this upgrade was blocking access to our >> tables, and we d

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-24 Thread Holden Karau
ng 2.1.1 with > INVER_NEVER and 2.2.0 with INFER_AND_SAVE. Clearly some kind of up-front > migration notes would help in identifying this new behavior in 2.2. > > Thanks, > > Michael > > > On Apr 24, 2017, at 2:09 AM, Wenchen Fan <wenc...@databricks.com> wrote: > > s

Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Holden Karau
+1 (non-binding) I (personally) think that Kubernetes as a scheduler backend should eventually get merged in and there is clearly a community interested in the work required to maintain it. On Tue, Aug 15, 2017 at 9:51 AM William Benton wrote: > +1 (non-binding) > > On Tue,

Re: spark pypy support?

2017-08-14 Thread Holden Karau
As Dong says yes we do test with PyPy in our CI env; but we expect a "newer" version of PyPy (although I don't think we ever bothered to write down what the exact version requirements are for the PyPy support unlike regular Python). On Mon, Aug 14, 2017 at 2:06 PM, Dong Joon Hyun

Re: spark pypy support?

2017-08-14 Thread Holden Karau
a8233, Apr 09 2015, > 02:17:39) > [PyPy 2.5.1 with GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] > > On Mon, Aug 14, 2017 at 2:24 PM, Holden Karau <hol...@pigscanfly.ca> > wrote: > > As Dong says yes we do test with PyPy in our CI env; but we expect a > "newer" > &g

  1   2   3   4   5   >