Re: Decision forests don't work with non-trivial categorical features

2014-10-13 Thread Joseph Bradley
Hi Sean, Sorry I didn't see this thread earlier! (Thanks Ameet for pinging me.) Short version: That exception should not be thrown, so there is a bug somewhere. The intended logic for handling high-arity categorical features is about the best one can do, as far as I know. Bug finding: For my

Re: Decision forests don't work with non-trivial categorical features

2014-10-13 Thread Joseph Bradley
cover. On Mon, Oct 13, 2014 at 7:12 PM, Joseph Bradley jos...@databricks.com wrote: Hi Sean, Sorry I didn't see this thread earlier! (Thanks Ameet for pinging me.) Short version: That exception should not be thrown, so there is a bug somewhere. The intended logic for handling high

Re: Issues with AbstractParams

2014-11-04 Thread Joseph Bradley
Hi Deb, Thanks for pointing it out! I don't know of a JIRA for it now, so it would be great if you could open one. I'm looking into the bug... Joseph On Tue, Nov 4, 2014 at 4:42 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I build the master today and I was testing IR statistics on

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-20 Thread Joseph Bradley
Could we move discussion of the design and implementation to the JIRA and/or a work-in-progress PR (tagged with [WIP])? That will help leave a record for the future. Thanks! Joseph On Wed, Nov 19, 2014 at 9:59 PM, Ashutosh ashutosh.triv...@iiitb.org wrote: Done. Thanks. Added you as a

Re: Evaluation Metrics for Spark's MLlib

2014-12-11 Thread Joseph Bradley
Hi, I'd recommend starting by checking out the existing helper functionality for these tasks. There are helper methods to do K-fold cross-validation in MLUtils: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala The experimental spark.ml

Re: Welcoming three new committers

2015-02-03 Thread Joseph Bradley
...@gmail.com: Hi all, The PMC recently voted to add three new committers: Cheng Lian, Joseph Bradley and Sean Owen. All three have been major contributors to Spark in the past year: Cheng on Spark SQL, Joseph on MLlib, and Sean on ML and many pieces throughout Spark Core. Join me in welcoming them

Re: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Joseph Bradley
Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance.

Re: K-Means And Class Tags

2015-01-14 Thread Joseph Bradley
However I can do this from Java, and it works in Scala: return words.rdd().retag(Vector.class); Dev On Thu, Jan 8, 2015 at 9:35 PM, Joseph Bradley jos...@databricks.com wrote: I believe you're running into an erasure issue which we found in DecisionTree too. Check out: https

Re: LinearRegressionWithSGD accuracy

2015-01-15 Thread Joseph Bradley
It looks like you're training on the non-scaled data but testing on the scaled data. Have you tried this training testing on only the scaled data? On Thu, Jan 15, 2015 at 10:42 AM, Devl Devel devl.developm...@gmail.com wrote: Thanks, that helps a bit at least with the NaN but the MSE is still

Re: [ml] Lost persistence for fold in crossvalidation.

2015-02-18 Thread Joseph Bradley
Now in JIRA form: https://issues.apache.org/jira/browse/SPARK-5844 On Tue, Feb 17, 2015 at 3:12 PM, Xiangrui Meng men...@gmail.com wrote: There are three different regParams defined in the grid and there are tree folds. For simplicity, we didn't split the dataset into three and reuse them,

Re: K-Means And Class Tags

2015-01-08 Thread Joseph Bradley
I believe you're running into an erasure issue which we found in DecisionTree too. Check out: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala#L134 That retags RDDs which were created from Java to prevent the exception you're running

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Joseph Bradley
+1 Tested on Mac OS X On Mon, Mar 9, 2015 at 3:30 PM, Xiangrui Meng men...@gmail.com wrote: Krishna, I tested your linear regression example. For linear regression, we changed its objective function from 1/n * \|A x - b\|_2^2 to 1/(2n) * \|Ax - b\|_2^2 to be consistent with common least

Re: Is this a bug in MLlib.stat.test ? About the mapPartitions API used in Chi-Squared test

2015-03-12 Thread Joseph Bradley
The checks against maxCategories are not for statistical purposes; they are to make sure communication does not blow up. There currently are not checks to make sure that there are enough entries for statistically significant results. That is up to the user. I do like the idea of adding a

Re: enum-like types in Spark

2015-03-04 Thread Joseph Bradley
another vote for #4 People are already used to adding () in Java. On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch java...@gmail.com wrote: #4 but with MemoryOnly (more scala-like) http://docs.scala-lang.org/style/naming-conventions.html Constants, Values, Variable and Methods Constant

Re: Have Friedman's glmnet algo running in Spark

2015-02-25 Thread Joseph Bradley
this may have some impact on what the release code looks like. Mike -Original Message--- *From:* Debasish Das [mailto:debasish.da...@gmail.com] *Sent:* Wednesday, February 25, 2015 08:50 AM *To:* 'Joseph Bradley' *Cc:* m...@mbowles.com, 'dev' *Subject:* Re: Have Friedman's glmnet algo

Re: Have Friedman's glmnet algo running in Spark

2015-02-22 Thread Joseph Bradley
Hi Mike, glmnet has definitely been very successful, and it would be great to see how we can improve optimization in MLlib! There is some related work ongoing; here are the JIRAs: GLMNET implementation in Spark https://issues.apache.org/jira/browse/SPARK-1673 LinearRegression with L1/L2

Re: Have Friedman's glmnet algo running in Spark

2015-02-24 Thread Joseph Bradley
of columns. Thanks for your help. Mike -Original Message- *From:* Joseph Bradley [mailto:jos...@databricks.com] *Sent:* Sunday, February 22, 2015 06:48 PM *To:* m...@mbowles.com *Cc:* dev@spark.apache.org *Subject:* Re: Have Friedman's glmnet algo running in Spark Hi Mike, glmnet has

Re: Stochastic gradient descent performance

2015-04-02 Thread Joseph Bradley
on this? I do understand that in cluster mode the network speed will kick in and then one can blame it. Best regards, Alexander *From:* Joseph Bradley [mailto:jos...@databricks.com] *Sent:* Thursday, April 02, 2015 10:51 AM *To:* Ulanov, Alexander *Cc:* dev@spark.apache.org *Subject:* Re

Re: GradientBoostTrees leaks a persisted RDD

2015-04-23 Thread Joseph Bradley
I saw the PR already, but only saw this just now. I think both persists are useful based on my experience, but it's very hard to say in general. On Thu, Apr 23, 2015 at 12:22 PM, jimfcarroll jimfcarr...@gmail.com wrote: Okay. PR: https://github.com/apache/spark/pull/5669 Jira:

Re: GradientBoostTrees leaks a persisted RDD

2015-04-22 Thread Joseph Bradley
Hi Jim, You're right; that should be unpersisted. Could you please create a JIRA and submit a patch? Thanks! Joseph On Wed, Apr 22, 2015 at 6:00 PM, jimfcarroll jimfcarr...@gmail.com wrote: Hi all, It appears GradientBoostedTrees.scala can call 'persist' on an RDD and never unpersist it.

Re: Indices of SparseVector must be ordered while computing SVD

2015-04-22 Thread Joseph Bradley
Hi Chunnan, There is currently Scala documentation for the constructor parameters: https://github.com/apache/spark/blob/04525c077c638a7e615c294ba988e35036554f5f/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala#L515 There is one benefit to not checking for validity (ordering)

Re: Predict.scala using model for clustering In reference

2015-05-07 Thread Joseph Bradley
A KMeansModel was trained in the previous step, and it was saved to modelFile as a Java object file. This step is loading the model back and reconstructing the KMeansModel, which can then be used to classify new tweets into different clusters. Joseph On Thu, May 7, 2015 at 12:40 PM, anshu shukla

Re: Contribute code to MLlib

2015-05-18 Thread Joseph Bradley
Hi Tarek, Thanks for your interest for checking the guidelines first! On 2 points: Algorithm: PCA is of course a critical algorithm. The main question is how your algorithm/implementation differs from the current PCA. If it's different and potentially better, I'd recommend opening up a JIRA

Re: [VOTE] Release Apache Spark 1.2.2

2015-04-15 Thread Joseph Bradley
+1 On Wed, Apr 15, 2015 at 5:40 PM, Tom Graves tgraves...@yahoo.com.invalid wrote: +1 tested on spark on yarn on hadoop 2.6 cluster with security. Tom On Sunday, April 5, 2015 6:25 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as

Re: [mllib] Deprecate static train and use builder instead for Scala/Java

2015-04-08 Thread Joseph Bradley
I'll add a note that this is just for ML, not other parts of Spark. (We can discuss more on the JIRA.) Thanks! Joseph On Mon, Apr 6, 2015 at 9:46 PM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com wrote: Hi all, Joseph proposed an idea about using just builder methods, instead of static train()

Re: [VOTE] Release Apache Spark 1.3.1 (RC2)

2015-04-08 Thread Joseph Bradley
+1 tested ML-related items on Mac OS X On Wed, Apr 8, 2015 at 7:59 PM, Krishna Sankar ksanka...@gmail.com wrote: +1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 14:16 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0

Re: IndexedRowMatrix semantics

2015-05-20 Thread Joseph Bradley
I believe it works with a mix of DenseVector and SparseVector types. Joseph On Wed, May 20, 2015 at 10:06 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, For indexedrowmatrix and rowmatrix, both take RDD(vector)is it possible that it has intermixed dense and sparse

Re: Contribute code to MLlib

2015-05-20 Thread Joseph Bradley
algorithms and let the user choose? Trevor Trevor Grant Data Scientist *Fortunate is he, who is able to know the causes of things. -Virgil* On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley jos...@databricks.com wrote: Hi Tarek, Thanks for your interest for checking the guidelines first

Re: MLlib: Anybody working on hierarchical topic models like HLDA?

2015-06-03 Thread Joseph Bradley
on a project in which I use the current LDA implementation that has been contributed by Databricks' Joseph Bradley et al. for the recent 1.3.0 release (thanks guys!). While this is great, my project requires several levels of topics, as I would like to offer users to drill down into subtopics

Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-07 Thread Joseph Bradley
+1 On Sat, Jun 6, 2015 at 7:55 PM, Guoqiang Li wi...@qq.com wrote: +1 (non-binding) -- Original -- *From: * Reynold Xin;r...@databricks.com; *Date: * Fri, Jun 5, 2015 03:18 PM *To: * Krishna Sankarksanka...@gmail.com; *Cc: * Patrick

Re: Random Forest driver memory

2015-06-18 Thread Joseph Bradley
Hi Isca, Could you please give more details? Data size, model parameters, stack traces / logs, etc. to help get a better picture? Thanks, Joseph On Wed, Jun 17, 2015 at 9:56 AM, Isca Harmatz pop1...@gmail.com wrote: hello, does anyone has any help on the issue? Isca On Tue, Jun 16,

Re: Contribution

2015-06-14 Thread Joseph Bradley
+1 for checking out the Wiki on Contributing to Spark. It gives helpful pointers about finding starter JIRAs, the discussion code review process, and how we prioritize algorithms other contributions. After you read that, I would recommend searching JIRA for issues which catch your interest.

Re: [DISCUSS] Minimize use of MINOR, BUILD, and HOTFIX w/ no JIRA

2015-06-10 Thread Joseph Bradley
+1 On Sat, Jun 6, 2015 at 9:01 AM, Patrick Wendell pwend...@gmail.com wrote: Hey All, Just a request here - it would be great if people could create JIRA's for any and all merged pull requests. The reason is that when patches get reverted due to build breaks or other issues, it is very

Re: [ml] Why all model classes are final?

2015-06-10 Thread Joseph Bradley
Hi Peter, We've tried to be cautious about making APIs public without need, to allow for changes needed in the future which we can't foresee now. Marking classes as final is part of that. While marking things as Experimental or DeveloperApi is a sort of warning, we've often felt that even

Re: [VOTE] Release Apache Spark 1.4.1

2015-06-30 Thread Joseph Bradley
+1 On Tue, Jun 30, 2015 at 5:27 PM, Reynold Xin r...@databricks.com wrote: +1 On Tue, Jun 23, 2015 at 10:37 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.1! This release fixes a handful of known issues in Spark

Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-28 Thread Joseph Bradley
Thanks for bringing this up! I talked with Michael Armbrust, and it sounds like this is a from a bug in DataFrame caching: https://issues.apache.org/jira/browse/SPARK-9141 It's marked as a blocker for 1.5. Joseph On Tue, Jul 28, 2015 at 2:36 AM, Justin Uang justin.u...@gmail.com wrote: Hey

Re: [ANNOUNCE] Spark branch-1.5

2015-08-03 Thread Joseph Bradley
I agree that it's high time to start changing/removing target versions, especially if component maintainers have a good idea of what is not needed for 1.5. I'll start doing that on ML. On Mon, Aug 3, 2015 at 12:05 PM, Sean Owen so...@cloudera.com wrote: Are these about the right rules of

Re: Make ML Developer APIs public (post-1.4)

2015-08-06 Thread Joseph Bradley
Eron, Thanks for sending out this list! We can make some of the critical ones public for 1.5, but they will be marked DeveloperApi since they may require changes in the future. Just made the JIRA: [ https://issues.apache.org/jira/browse/SPARK-9704] and I'll send a PR soon. Joseph On Mon, Aug

Re: Are These Issues Suitable for our Senior Project?

2015-07-15 Thread Joseph Bradley
Per recent comments on SPARK-6442, I'd recommend not working on that one for now. Instead, even if tasks are not that interesting to you, you should try some small tasks at first to get used to contributing. I am quite sure we'll want to solve SPARK-3703 by May 2016; that's pretty far in the

Re: slightly more informative error message in MLUtils.loadLibSVMFile

2015-11-16 Thread Joseph Bradley
That sounds useful; would you mind submitting a JIRA (and a PR if you're willing)? Thanks, Joseph On Fri, Oct 23, 2015 at 12:43 PM, Robert Dodier wrote: > Hi, > > MLUtils.loadLibSVMFile verifies that indices are 1-based and > increasing, and otherwise triggers an error.

Re: Spark Implementation of XGBoost

2015-11-16 Thread Joseph Bradley
One comment about """ 1) I agree the sorting method you suggested is a very efficient way to handle the unordered categorical variables in binary classification and regression. I propose we have a Spark ML Transformer to do the sorting and encoding, bringing the benefits to many tree based

Re: Unchecked contribution (JIRA and PR)

2015-11-16 Thread Joseph Bradley
Hi Sergio, Apart from apologies about limited review bandwidth (from me too!), I wanted to add: It would be interesting to hear what feedback you've gotten from users of your package. Perhaps you could collect feedback by (a) emailing the user list and (b) adding a note in the Spark Packages

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-07 Thread Joseph Bradley
+1 tested on OS X On Sat, Nov 7, 2015 at 10:25 AM, Reynold Xin wrote: > +1 myself too > > On Sat, Nov 7, 2015 at 12:01 AM, Robin East > wrote: > >> +1 >> Mac OS X 10.10.5 Yosemite >> >> mvn clean package -DskipTests (13min) >> >> Basic graph tests

Re: Gradient Descent with large model size

2015-10-15 Thread Joseph Bradley
For those numbers of partitions, I don't think you'll actually use tree aggregation. The number of partitions needs to be over a certain threshold (>= 7) before treeAggregate really operates on a tree structure:

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-07 Thread Joseph Bradley
Hi YiZhi Liu, The spark.ml classes are part of the higher-level "Pipelines" API, which works with DataFrames. When creating this API, we decided to separate it from the old API to avoid confusion. You can read more about it here: http://spark.apache.org/docs/latest/ml-guide.html For (3): We

Re: Are These Issues Suitable for our Senior Project?

2015-07-09 Thread Joseph Bradley
It would be great to get more contributions! If you're new to contributing, it will be good to start with some small contributions and check out: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark But if those build up to a larger contribution, the top ones I'd pick out are:

Re: [MLlib] Extensibility of MLlib classes (Word2VecModel etc.)

2015-09-14 Thread Joseph Bradley
We tend to resist opening up APIs unless there's a strong reason to and we feel reasonably confident that the API will remain stable. That allows us to make fixes if we realize there are issues with those APIs. But if you have an important use case, I'd recommend opening up a JIRA to discuss it.

Re: Enum parameter in ML

2015-09-16 Thread Joseph Bradley
I've tended to use Strings. Params can be created with a validator (isValid) which can ensure users get an immediate error if they try to pass an unsupported String. Not as nice as compile-time errors, but easier on the APIs. On Mon, Sep 14, 2015 at 6:07 PM, Feynman Liang

Re: Enum parameter in ML

2015-09-16 Thread Joseph Bradley
gt;> >> >> Strings sounds reasonable. However, there is no StringParam (only >> StringArrayParam). Should I create a new param type? Also, how can the user >> get all possible values of String parameter? >> >> >> >> Best regards, Alexander >> >>

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Joseph Bradley
+1 Tested MLlib on Mac OS X On Thu, Sep 24, 2015 at 6:14 PM, Reynold Xin wrote: > Krishna, > > Thanks for testing every release! > > > On Thu, Sep 24, 2015 at 6:08 PM, Krishna Sankar > wrote: > >> +1 (non-binding, of course) >> >> 1. Compiled OSX

Re: Problem in running MLlib SVM

2015-12-01 Thread Joseph Bradley
around 57% on training set. > > On Mon, Nov 30, 2015 at 6:33 PM, Joseph Bradley <jos...@databricks.com> > wrote: > >> model.predict should return a 0/1 predicted label. The example code is >> misleading when it calls the prediction a "score." >> >> On M

Re: [ML] Missing documentation for the IndexToString feature transformer

2015-12-05 Thread Joseph Bradley
Thanks for reporting this! I just added a JIRA: https://issues.apache.org/jira/browse/SPARK-12159 That would be great if you could send a PR for it; thanks! Joseph On Sat, Dec 5, 2015 at 5:02 AM, Benjamin Fradet wrote: > Hi, > > I was wondering why the IndexToString

Re: Grid search with Random Forest

2015-12-01 Thread Joseph Bradley
pache.org/docs/latest/ml-ensembles.html#output-columns-predictions-1 >>> On 1 Dec 2015 3:57 a.m., "Ndjido Ardo BAR" <ndj...@gmail.com> wrote: >>> >>>> Hi Joseph, >>>> >>>> Yes Random Forest support Grid Search on Spark 1.5.+ . But I'

Re: Python API for Association Rules

2015-12-02 Thread Joseph Bradley
If you're working on a feature, please comment on the JIRA first (to avoid conflicts / duplicate work). Could you please copy what your wrote to the JIRA to discuss there? Thanks, Joseph On Wed, Dec 2, 2015 at 4:51 AM, caiquermarques95 wrote: > Hello everyone! >

Re: java.lang.NoSuchMethodError while saving a random forest model Spark version 1.5

2015-12-16 Thread Joseph Bradley
This method is tested in the Spark 1.5 unit tests, so I'd guess it's a problem with the Parquet dependency. What version of Parquet are you building Spark 1.5 off of? (I'm not that familiar with Parquet issues myself, but hopefully a SQL person can chime in.) On Tue, Dec 15, 2015 at 3:23 PM,

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-16 Thread Joseph Bradley
+1 On Wed, Dec 16, 2015 at 5:26 PM, Reynold Xin wrote: > +1 > > > On Wed, Dec 16, 2015 at 5:24 PM, Mark Hamstra > wrote: > >> +1 >> >> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust > > wrote: >> >>> Please vote on

Re: BIRCH clustering algorithm

2015-12-15 Thread Joseph Bradley
Hi Dzeno, I'm not familiar with the algorithm myself, but if you have an important use case for it, you could open a JIRA to discuss it. However, if it is a less common algorithm, I'd recommend first submitting it as a Spark package (but publicizing the package on the user list). If it gains

Re: SparkML algos limitations question.

2015-12-15 Thread Joseph Bradley
Hi Eugene, The maxDepth parameter exists because the implementation uses Integer node IDs which correspond to positions in the binary tree. This simplified the implementation. I'd like to eventually modify it to avoid depending on tree node IDs, but that is not yet on the roadmap. There is not

Re: running lda in spark throws exception

2015-12-29 Thread Joseph Bradley
Hi Li, I'm wondering if you're running into the same bug reported here: https://issues.apache.org/jira/browse/SPARK-12488 I haven't figured out yet what is causing it. Do you have a small corpus which reproduces this error, and which you can share on the JIRA? If so, that would help a lot in

Re: Problem in running MLlib SVM

2015-11-30 Thread Joseph Bradley
model.predict should return a 0/1 predicted label. The example code is misleading when it calls the prediction a "score." On Mon, Nov 30, 2015 at 9:13 AM, Fazlan Nazeem wrote: > You should never use the training data to measure your prediction > accuracy. Always use a fresh

Re: Grid search with Random Forest

2015-11-30 Thread Joseph Bradley
It should work with 1.5+. On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar wrote: > > Hi folks, > > Does anyone know whether the Grid Search capability is enabled since the > issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol > column doesn't exist" when

Re: Unhandled case in VectorAssembler

2015-11-20 Thread Joseph Bradley
Yes, please, could you send a JIRA (and PR)? A custom error message would be better. Thank you! Joseph On Fri, Nov 20, 2015 at 2:39 PM, BenFradet wrote: > Hey there, > > I noticed that there is an unhandled case in the transform method of > VectorAssembler if one of

Re: spark-submit is throwing NPE when trying to submit a random forest model

2015-11-19 Thread Joseph Bradley
Hi, Could you please submit this via JIRA as a bug report? It will be very helpful if you include the Spark version, system details, and other info too. Thanks! Joseph On Thu, Nov 19, 2015 at 1:21 PM, Rachana Srivastava < rachana.srivast...@markmonitor.com> wrote: > *Issue:* > > I have a random

Re: Welcoming Yanbo Liang as a committer

2016-06-12 Thread Joseph Bradley
Congrats & welcome! On Tue, Jun 7, 2016 at 7:15 AM, Xiangrui Meng wrote: > Congrats!! > > On Mon, Jun 6, 2016, 8:12 AM Gayathri Murali > wrote: > >> Congratulations Yanbo Liang! Well deserved. >> >> >> On Sun, Jun 5, 2016 at 7:10 PM,

Re: DAG in Pipeline

2016-06-12 Thread Joseph Bradley
One more note: When you specify the stages in the Pipeline, they need to be in topological order according to the DAG. On Sun, Jun 12, 2016 at 10:47 AM, Joseph Bradley <jos...@databricks.com> wrote: > Hi Pranay, > > Yes, you can do this. The DAG structure should be specified vi

Re: Shrinking the DataFrame lineage

2016-06-12 Thread Joseph Bradley
ed problem handled in GraphFrames? Suppose, I want to >> use aggregateMessages in the iterative loop, for implementing PageRank. >> >> >> >> Best regards, Alexander >> >> >> >> *From:* Joseph Bradley [mailto:jos...@databricks.com] >> *Sent:* Fr

Re: DAG in Pipeline

2016-06-12 Thread Joseph Bradley
Hi Pranay, Yes, you can do this. The DAG structure should be specified via the various Transformers' input and output columns, where a Transformer can have multiple input and/or output columns. Most of the classification and regression Models are good examples of Transformers with multiple

Re: Implementing linear albegra operations in the distributed linalg package

2016-06-10 Thread Joseph Bradley
I agree that more distributed matrix ops would be good to have, but I think there are a few things which need to happen first: * Now that the spark.ml package has local linear algebra separate from the spark.mllib package, we should migrate the distributed linear algebra implementations over to

Re: Hello

2016-06-20 Thread Joseph Bradley
Hi Harmeet, I'll add one more item to the other advice: The community is in the process of putting together a roadmap JIRA for 2.1 for ML: https://issues.apache.org/jira/browse/SPARK-15581 This JIRA lists some of the major items and links to a few umbrella JIRAs with subtasks. I'd expect this

Re: Welcoming two new committers

2016-02-08 Thread Joseph Bradley
Congrats & welcome! On Mon, Feb 8, 2016 at 12:19 PM, Ram Sriharsha wrote: > great job guys! congrats and welcome! > > On Mon, Feb 8, 2016 at 12:05 PM, Amit Chavan wrote: > >> Welcome. >> >> On Mon, Feb 8, 2016 at 2:50 PM, Suresh Thalamati < >>

Re: Adding Naive Bayes sample code in Documentation

2016-01-29 Thread Joseph Bradley
JIRA created! https://issues.apache.org/jira/browse/SPARK-13089 Feel free to pick it up if you're interested. : ) Joseph On Wed, Jan 27, 2016 at 8:43 AM, Vinayak Agrawal wrote: > Hi, > I was reading through Spark ML package and I couldn't find Naive Bayes >

Re: Spark LDA model reuse with new set of data

2016-01-26 Thread Joseph Bradley
Hi, This is more a question for the user list, not the dev list, so I'll CC user. If you're using mllib.clustering.LDAModel (RDD API), then can you make sure you're using a LocalLDAModel (or convert to it from DistributedLDAModel)? You can then call topicDistributions() on the new data. If

Re: pull request template

2016-03-15 Thread Joseph Bradley
+1 for keeping the template I figure any template will require conscientiousness & enforcement. On Sat, Mar 12, 2016 at 1:30 AM, Sean Owen wrote: > The template is a great thing as it gets instructions even more right > in front of people. > > Another idea is to just write

Re: Different maxBins value for categorical and continuous features in RandomForest implementation.

2016-04-12 Thread Joseph Bradley
That sounds useful. Would you mind creating a JIRA for it? Thanks! Joseph On Mon, Apr 11, 2016 at 2:06 AM, Rahul Tanwani wrote: > Hi, > > Currently the RandomForest algo takes a single maxBins value to decide the > number of splits to take. This sometimes causes

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Joseph Bradley
+1 By the way, the JIRA for tracking (Scala) API parity is: https://issues.apache.org/jira/browse/SPARK-4591 On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia wrote: > This sounds good to me as well. The one thing we should pay attention to > is how we update the docs so

Re: SparkML algos limitations question.

2016-03-21 Thread Joseph Bradley
e that in most cases I simply won't hit it, but the depth > of the tree would be much more, than 30. > > > -- > Be well! > Jean Morozov > > On Wed, Dec 16, 2015 at 1:00 AM, Joseph Bradley <jos...@databricks.com> > wrote: > >> Hi Eugene, >

Merging ML Estimator and Model

2016-03-21 Thread Joseph Bradley
Spark devs & users, I want to bring attention to a proposal to merge the MLlib (spark.ml) concepts of Estimator and Model in Spark 2.0. Please comment & discuss on SPARK-14033 (not in this email thread). *TL;DR:* *Proposal*: Merge Estimator

Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-25 Thread Joseph Bradley
There have been some comments about using Pipelines outside of ML, but I have not yet seen a real need for it. If a user does want to use Pipelines for non-ML tasks, they still can use Transformers + PipelineModels. Will that work? On Fri, Mar 25, 2016 at 8:05 AM, Jacek Laskowski

Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-29 Thread Joseph Bradley
RDD/DataFrame space. >>> > >>> > So, to promote a more extensive use of Pipelines, PipelineStages, and >>> > Transformers, I was thinking about moving that part to SQL/DataFrame >>> > API where they really belong. If not, I think people might miss the >&

Re: running lda in spark throws exception

2016-04-04 Thread Joseph Bradley
t; >> >>> > at > >> >> >>> > > >> >> >>> > > >> >> >>> > > org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix$lzycompute(LDAModel.scala:531) > >> >&g

Re: net.razorvine.pickle.PickleException in Pyspark

2016-04-25 Thread Joseph Bradley
Thanks for your work on this. Can we continue discussing on the JIRA? On Sun, Apr 24, 2016 at 9:39 AM, Caique Marques wrote: > Hello, everyone! > > I'm trying to implement the association rules in Python. I got implement > an association by a frequent element, works

Re: Decrease shuffle in TreeAggregate with coalesce ?

2016-04-27 Thread Joseph Bradley
Do you have code which can reproduce this performance drop in treeReduce? It would be helpful to debug. In the 1.6 release, we profiled it via the various MLlib algorithms and did not see performance drops. It's not just renumbering the partitions; it is reducing the number of partitions by a

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Joseph Bradley
+1 On Wed, May 18, 2016 at 10:49 AM, Reynold Xin wrote: > Hi Ovidiu-Cristian , > > The best source of truth is change the filter with target version to > 2.1.0. Not a lot of tickets have been targeted yet, but I'd imagine as we > get closer to 2.0 release, more will be

Re: Shrinking the DataFrame lineage

2016-05-13 Thread Joseph Bradley
Here's a JIRA for it: https://issues.apache.org/jira/browse/SPARK-13346 I don't have a great method currently, but hacks can get around it: convert the DataFrame to an RDD and back to truncate the query plan lineage. Joseph On Wed, May 11, 2016 at 12:46 PM, Ulanov, Alexander <

Re: Organizing Spark ML example packages

2016-04-20 Thread Joseph Bradley
Sounds good to me. I'd request we be strict during this process about requiring *no* changes to the example itself, which will make review easier. On Tue, Apr 19, 2016 at 11:12 AM, Bryan Cutler wrote: > +1, adding some organization would make it easier for people to find a >

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-22 Thread Joseph Bradley
+1 Mainly tested ML/Graph/R. Perf tests from Tim Hunter showed minor speedups from 1.6 for common ML algorithms. On Thu, Jul 21, 2016 at 9:41 AM, Ricardo Almeida < ricardo.alme...@actnowib.com> wrote: > +1 (non binding) > > Tested PySpark Core, DataFrame/SQL, MLlib and Streaming on a

Re: Welcoming Felix Cheung as a committer

2016-08-16 Thread Joseph Bradley
Welcome Felix! On Mon, Aug 15, 2016 at 6:16 AM, mayur bhole wrote: > Congrats Felix! > > On Mon, Aug 15, 2016 at 2:57 PM, Paul Roy wrote: > >> Congrats Felix >> >> Paul Roy. >> >> On Mon, Aug 8, 2016 at 9:15 PM, Matei Zaharia

PSA: Java 8 unidoc build

2017-02-06 Thread Joseph Bradley
and others who have made many fixes for this! See these sample PRs for some issues causing failures (especially around links): https://github.com/apache/spark/pull/16741 https://github.com/apache/spark/pull/16604 Thanks, Joseph -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc

Re: Feedback on MLlib roadmap process proposal

2017-01-23 Thread Joseph Bradley
M > Subject: Re: Feedback on MLlib roadmap process proposal > To: Seth Hendrickson <seth.hendrickso...@gmail.com> > Cc: Joseph Bradley <jos...@databricks.com>, <dev@spark.apache.org> > > > > +1 general abstractions like distributed linear algebra. > > On Thu, Ja

MLlib mission and goals

2017-01-23 Thread Joseph Bradley
bilities, and it will be great to hear the community's thoughts! Thanks, Joseph -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Joseph Bradley
mber of areas in Spark, > including > > > linear algebra, stats/maths functions in DataFrames, Python/R APIs for > > > DataFrames, dstream, and most recently Structured Streaming. > > > > > > Holden has been a long time Spark contributor and evangelist. She has > > > written a few books on Spark, as well as frequent contributions to the > > > Python API to improve its usability and performance. > > > > > > Please join me in welcoming the two! > > > > > > > > > > > > > > > > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Feedback on MLlib roadmap process proposal

2017-01-26 Thread Joseph Bradley
work they > believe needs doing, and shepherd work initiated by others (a clear bug > report, a PR) to a resolution. Things get done by doing them, or by > building influence by doing other things the project needs doing. It isn't > a mechanical, objective process, and can't be. But it does work in a > recognizable way. > >> -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: MLlib mission and goals

2017-01-24 Thread Joseph Bradley
ory_Bandwidth_and_Machine_Balance_in_Current_High_Performance_Computers> > > -- > View this message in context: Re: MLlib mission and goals > <http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-mission-and-goals-tp20715p20754.html> &g

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-15 Thread Joseph Bradley
gt;>> >>> Hi all, >>> >>> Takuya-san has recently been elected an Apache Spark committer. He's >>> been active in the SQL area and writes very small, surgical patches that >>> are high quality. Please join me in congratulating Takuya-san! >>> >>> >>> >>> >> > > > -- > Takuya UESHIN > Tokyo, Japan > > http://twitter.com/ueshin > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

2017-02-23 Thread Joseph Bradley
implementation is visible and for lower level integration, > > What I tend to do is keep my own code in its package and try to do as > think a bridge over to it from the [private] scope. It's also important to > name things obviously, say, org.apache.spark.microsoft , so stack trace

Re: Spark Improvement Proposals

2017-02-24 Thread Joseph Bradley
> >>>>>> with multiple Committers and active users. I heard many > fantastic > >>> >>>>>> ideas. I > >>> >>>>>> believe Spark improvement proposals are good channels to collect > >>> >>>>>> the > >>> >>>>>> requirements/designs. > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> IMO, we also need to consider the priority when working on these > >>> >>>>>> items. > >>> >>>>>> Even if the proposal is accepted, it does not mean it will be > >>> >>>>>> implemented > >>> >>>>>> and merged immediately. It is not a FIFO queue. > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> Even if some PRs are merged, sometimes, we still have to revert > >>> >>>>>> them > >>> >>>>>> back, if the design and implementation are not reviewed > carefully. > >>> >>>>>> We have > >>> >>>>>> to ensure our quality. Spark is not an application software. It > is > >>> >>>>>> an > >>> >>>>>> infrastructure software that is being used by many many > companies. > >>> >>>>>> We have > >>> >>>>>> to be very careful in the design and implementation, especially > >>> >>>>>> adding/changing the external APIs. > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> When I developed the Mainframe infrastructure/middleware > software > >>> >>>>>> in > >>> >>>>>> the past 6 years, I were involved in the discussions with > >>> >>>>>> external/internal > >>> >>>>>> customers. The to-do feature list was always above 100. > Sometimes, > >>> >>>>>> the > >>> >>>>>> customers are feeling frustrated when we are unable to deliver > >>> >>>>>> them on time > >>> >>>>>> due to the resource limits and others. Even if they paid us > >>> >>>>>> billions, we > >>> >>>>>> still need to do it phase by phase or sometimes they have to > >>> >>>>>> accept the > >>> >>>>>> workarounds. That is the reality everyone has to face, I think. > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> Thanks, > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> Xiao Li > >>> >>>>>>> > >>> >>>>>>> > >>> >> > >>> > > >>> > > - > >>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >>> > > >>> > >>> - > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >>> > >> > >> > >> > >> -- > >> Ryan Blue > >> Software Engineer > >> Netflix > > > > > > > > > > -- > > Regards, > > Vaquar Khan > > +1 -224-436-0783 > > > > IT Architect / Lead Consultant > > Greater Chicago > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Feedback on MLlib roadmap process proposal

2017-01-17 Thread Joseph Bradley
munication. * This is fairly orthogonal to the SIP discussion since this proposal is more about setting release targets than about proposing future plans. Thanks! Joseph -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Joseph Bradley
+1 On Mon, Sep 26, 2016 at 7:47 AM, Denny Lee wrote: > +1 (non-binding) > On Sun, Sep 25, 2016 at 23:20 Jeff Zhang wrote: > >> +1 >> >> On Mon, Sep 26, 2016 at 2:03 PM, Shixiong(Ryan) Zhu < >> shixi...@databricks.com> wrote: >> >>> +1 >>> >>> On Sun,

Re: welcoming Xiao Li as a committer

2016-10-05 Thread Joseph Bradley
Congrats! On Tue, Oct 4, 2016 at 4:09 PM, Kousuke Saruta wrote: > Congratulations Xiao! > > - Kousuke > On 2016/10/05 7:44, Bryan Cutler wrote: > > Congrats Xiao! > > On Tue, Oct 4, 2016 at 11:14 AM, Holden Karau > wrote: > >> Congratulations :D

Re: [ML]Random Forest Error : Size exceeds Integer.MAX_VALUE

2016-10-05 Thread Joseph Bradley
Could you please file a bug report JIRA and also include more info about what you ran? * Random forest Param settings * dataset dimensionality, partitions, etc. Thanks! On Tue, Oct 4, 2016 at 10:44 PM, Samkit Shah wrote: > Hello folks, > I am running Random Forest from ml

  1   2   >