Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread jimfcarroll
Hi Sean and Joe, I have another question. GradientBoostedTrees.run iterates over the RDD calling DecisionTree.run on each iteration with a new random sample from the input RDD. DecisionTree.run calls RandomForest.run. which also calls persist. One of these seems superfluous. Should I simply

Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread jimfcarroll
Hi Joe, Do you want a PR per branch (one for master, one for 1.3)? Are you still maintaining 1.2? Do you need a Jira ticket per PR or can I submit them all under the same ticket? Or should I just submit it to master and let you guys back-port it? Jim -- View this message in context:

Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread Sean Owen
Only against master; it can be cherry-picked to other branches. On Thu, Apr 23, 2015 at 10:53 AM, jimfcarroll jimfcarr...@gmail.com wrote: Hi Joe, Do you want a PR per branch (one for master, one for 1.3)? Are you still maintaining 1.2? Do you need a Jira ticket per PR or can I submit them

Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread Sean Owen
Those are different RDDs that DecisionTree persists, though. It's not redundant. On Thu, Apr 23, 2015 at 11:12 AM, jimfcarroll jimfcarr...@gmail.com wrote: Hi Sean and Joe, I have another question. GradientBoostedTrees.run iterates over the RDD calling DecisionTree.run on each iteration

Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Reynold Xin
Ah damn. We need to add it to the Python list. Would you like to give it a shot? On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yep no problem, but I can't seem to find the coalesce fonction in pyspark.sql.{*, functions, types or whatever :) }

Re: [discuss] new Java friendly InputSource API

2015-04-23 Thread Reynold Xin
In the ctor of InputSource (I'm also considering adding an explicit initialize call), the implementation of InputSource can execute arbitrary code. The state in it will also be serialized and passed onto the executors. Yes - technically you can hijack getSplits in Hadoop InputFormat to do the

Re: [discuss] new Java friendly InputSource API

2015-04-23 Thread Mingyu Kim
Hi Reynold, You mentioned that the new API allows arbitrary code to be run on the driver side, but it¹s not very clear to me how this is different from what Hadoop API provides. In your example of using broadcast, did you mean broadcasting something in InputSource.getPartitions() and having

Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread jimfcarroll
Okay. PR: https://github.com/apache/spark/pull/5669 Jira: https://issues.apache.org/jira/browse/SPARK-7100 Hope that helps. Let me know if you need anything else. Jim -- View this message in context:

Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Olivier Girardot
yep :) I'll open the jira when I've got the time. Thanks Le jeu. 23 avr. 2015 à 19:31, Reynold Xin r...@databricks.com a écrit : Ah damn. We need to add it to the Python list. Would you like to give it a shot? On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot

Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Olivier Girardot
What is the way of testing/building the pyspark part of Spark ? Le jeu. 23 avr. 2015 à 22:06, Olivier Girardot o.girar...@lateral-thoughts.com a écrit : yep :) I'll open the jira when I've got the time. Thanks Le jeu. 23 avr. 2015 à 19:31, Reynold Xin r...@databricks.com a écrit : Ah

Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Olivier Girardot
I found another way setting a SPARK_HOME on a released version and launching an ipython to load the contexts. I may need your insight however, I found why it hasn't been done at the same time, this method (like some others) uses a varargs in Scala and for now the way functions are called only one

Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread Joseph Bradley
I saw the PR already, but only saw this just now. I think both persists are useful based on my experience, but it's very hard to say in general. On Thu, Apr 23, 2015 at 12:22 PM, jimfcarroll jimfcarr...@gmail.com wrote: Okay. PR: https://github.com/apache/spark/pull/5669 Jira:

RE: Should we let everyone set Assignee?

2015-04-23 Thread Ulanov, Alexander
My thinking is that current way of assigning a contributor after the patch is done (or almost done) is OK. Parallel efforts are also OK until they are discussed in the issue's thread. Ilya Ganelin made a good point that it is about moving the project forward. It also adds means of competition

RE: Should we let everyone set Assignee?

2015-04-23 Thread Sean Owen
The merge script automatically updates the linked JIRA after merging the PR (why it is important to put the JIRA in the title). It can't auto assign the JIRA since usernames dont match up but it is an easy reminder to set the Assignee. I do right after and I think other committers do too. I'll

Let's set Assignee for Fixed JIRAs

2015-04-23 Thread Sean Owen
Following my comment earlier that I think we set Assignee for Fixed JIRAs consistently, I found there are actually 880 counter examples. Lots of them are old, and I'll try to fix as many that are recent (for the 1.4.0 release credits) as I can stand to click through. Let's set Assignee after

Re: Let's set Assignee for Fixed JIRAs

2015-04-23 Thread Luciano Resende
On Thu, Apr 23, 2015 at 5:47 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: You’d need to add them as a contributor in the JIRA admin page. Once you do that, you should be able to assign the jira to that person Is this documented, and does every PMC (or committer) have access to do

Re: Let's set Assignee for Fixed JIRAs

2015-04-23 Thread Luciano Resende
On Thu, Apr 23, 2015 at 5:26 PM, Sean Owen so...@cloudera.com wrote: Following my comment earlier that I think we set Assignee for Fixed JIRAs consistently, I found there are actually 880 counter examples. Lots of them are old, and I'll try to fix as many that are recent (for the 1.4.0

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-23 Thread Sourav Chandra
HI TD, Some observations: 1. If I submit the application using spark-submit tool with *client as deploy mode* it works fine with single master and worker (driver, master and worker are running in same machine) 2. If I submit the application using spark-submit tool with client as deploy mode it

Contributors, read me! Updated Contributing to Spark wiki

2015-04-23 Thread Sean Owen
Following several discussions about how to improve the contribution process in Spark, I've overhauled the guide to contributing. Anyone who is going to contribute needs to read it, as it has more formal guidance about the process:

Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Olivier Girardot
I'll try thanks Le ven. 24 avr. 2015 à 00:09, Reynold Xin r...@databricks.com a écrit : You can do it similar to the way countDistinct is done, can't you? https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L78 On Thu, Apr 23, 2015 at 1:59 PM, Olivier Girardot

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-23 Thread Sourav Chandra
*bump* On Thu, Apr 23, 2015 at 3:46 PM, Sourav Chandra sourav.chan...@livestream.com wrote: HI TD, Some observations: 1. If I submit the application using spark-submit tool with *client as deploy mode* it works fine with single master and worker (driver, master and worker are running in

Contributing Documentation Changes

2015-04-23 Thread madhu phatak
Hi, As I was reading contributing to Spark wiki, it was mentioned that we can contribute external links to spark tutorials. I have written many http://blog.madhukaraphatak.com/categories/spark/ of them in my blog. It will be great if someone can add it to the spark website. Regards, Madhukara

First-class support for pip/virtualenv in pyspark

2015-04-23 Thread Justin Uang
Hi, I have been trying to figure out how to ship a python package that I have been working on, and this has brought up a couple questions to me. Please note that I'm fairly new to python package management, so any feedback/corrections is welcome =) It looks like the --py-files support we have