Re: Feedback: Feature request

2015-08-28 Thread Manish Amde
Sounds good. It's a request I have seen a few times in the past and have
needed it personally. May be Joseph Bradley has something to add.

I think a JIRA to capture this will be great. We can move this discussion
to the JIRA then.

On Friday, August 28, 2015, Cody Koeninger c...@koeninger.org wrote:

 I wrote some code for this a while back, pretty sure it didn't need access
 to anything private in the decision tree / random forest model.  If people
 want it added to the api I can put together a PR.

 I think it's important to have separately parseable operators / operands
 though.  E.g

 lhs:0,op:=,rhs:-35.0
 On Aug 28, 2015 12:03 AM, Manish Amde manish...@gmail.com
 javascript:_e(%7B%7D,'cvml','manish...@gmail.com'); wrote:

 Hi James,

 It's a good idea. A JSON format is more convenient for visualization
 though a little inconvenient to read. How about toJson() method? It might
 make the mllib api inconsistent across models though.

 You should probably create a JIRA for this.

 CC: dev list

 -Manish

 On Aug 26, 2015, at 11:29 AM, Murphy, James james.mur...@disney.com
 javascript:_e(%7B%7D,'cvml','james.mur...@disney.com'); wrote:

 Hey all,



 In working with the DecisionTree classifier, I found it difficult to
 extract rules that could easily facilitate visualization with libraries
 like D3.



 So for example, using : print(model.toDebugString()), I get the following
 result =



If (feature 0 = -35.0)

   If (feature 24 = 176.0)

 Predict: 2.1

   If (feature 24 = 176.0)

 Predict: 4.2

   Else (feature 24  176.0)

 Predict: 6.3

 Else (feature 0  -35.0)

   If (feature 24 = 11.0)

 Predict: 4.5

   Else (feature 24  11.0)

 Predict: 10.2



 But ideally, I could see results in a more parseable format like JSON:



 {

 node: [

 {

 name:node1,

 rule:feature 0 = -35.0,

 children:[

 {

   name:node2,

   rule:feature 24 = 176.0,

   children:[

   {

   name:node4,

   rule:feature 20  116.0,

   predict:  2.1

   },

   {

   name:node5,

   rule:feature 20 = 116.0,

   predict: 4.2

   },

   {

   name:node5,

   rule:feature 20  116.0,

   predict: 6.3

   }

   ]

 },

 {

 name:node3,

 rule:feature 0  -35.0,

   children:[

   {

   name:node7,

   rule:feature 3 = 11.0,

   predict: 4.5

   },

   {

   name:node8,

   rule:feature 3  11.0,

   predict: 10.2

   }

   ]

 }



 ]

 }

 ]

 }



 Food for thought!



 Thanks,



 Jim






Re: Feedback: Feature request

2015-08-27 Thread Manish Amde
Hi James,

It's a good idea. A JSON format is more convenient for visualization though a 
little inconvenient to read. How about toJson() method? It might make the mllib 
api inconsistent across models though. 

You should probably create a JIRA for this.

CC: dev list

-Manish

 On Aug 26, 2015, at 11:29 AM, Murphy, James james.mur...@disney.com wrote:
 
 Hey all,
  
 In working with the DecisionTree classifier, I found it difficult to extract 
 rules that could easily facilitate visualization with libraries like D3.
  
 So for example, using : print(model.toDebugString()), I get the following 
 result =
  
If (feature 0 = -35.0)
   If (feature 24 = 176.0)
 Predict: 2.1
   If (feature 24 = 176.0)
 Predict: 4.2
   Else (feature 24  176.0)
 Predict: 6.3
 Else (feature 0  -35.0)
   If (feature 24 = 11.0)
 Predict: 4.5
   Else (feature 24  11.0)
 Predict: 10.2
  
 But ideally, I could see results in a more parseable format like JSON:
  
 {
 node: [
 {
 name:node1,
 rule:feature 0 = -35.0,
 children:[
 {
   name:node2,
   rule:feature 24 = 176.0,
   children:[
   {
   name:node4,
   rule:feature 20  116.0,
   predict:  2.1
   },
   {
   name:node5,
   rule:feature 20 = 116.0,
   predict: 4.2
   },
   {
   name:node5,
   rule:feature 20  116.0,
   predict: 6.3
   }
   ]
 },
 {
 name:node3,
 rule:feature 0  -35.0,
   children:[
   {
   name:node7,
   rule:feature 3 = 11.0,
   predict: 4.5
   },
   {
   name:node8,
   rule:feature 3  11.0,
   predict: 10.2
   }
   ]
 }
  
 ]
 }
 ]
 }
  
 Food for thought!
  
 Thanks,
  
 Jim
  


Re: DecisionTree Algorithm used in Spark MLLib

2015-01-01 Thread Manish Amde
Hi Anoop,

The Spark decision tree implementation supports: regression and multi class
classification, continuous and categorical features, pruning and does not
support missing features at present. You can probably think of it as
distributed CART though personally I always find the acronyms confusing.

How much difference are you seeing? There is a very small difference in how
the candidate split thresholds are calculated in various libraries (there
is no right way) but it should not lead to significant difference in
performance.

-Manish


On Monday, December 29, 2014, Anoop Shiralige anoop.shiral...@gmail.com
wrote:

 Hi All,

 I am trying to do a comparison, by building the model locally using R and
 on cluster using spark.
 There is some difference in the results.

 Any idea what is the internal implementation of Decision Tree in Spark
 MLLib.. (ID3 or C4.5 or C5.0 or CART algorithm).

 Thanks,
 AnoopShiralige



Re: Print Node info. of Decision Tree

2014-12-08 Thread Manish Amde
Hi Jake,

The toString method should print the full model in versions 1.1.x.

The current master branch has a method toDebugString for
DecisionTreeModel which should print out all the node classes and the
toString method has been updated to print the summary only so there is a
slight change in the upcoming release 1.2.x.

-Manish

On Sun, Dec 7, 2014 at 9:17 PM, jake Lim itwiza...@gmail.com wrote:

 How can i print Node info. of Decision Tree model?
 I want to navigate and print all information of Decision tree Model.
 Is there some kind of function/method to support it?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Print-Node-info-of-Decision-Tree-tp20572.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Status of MLLib exporting models to PMML

2014-11-17 Thread Manish Amde
Hi Charles,

I am not aware of other storage formats. Perhaps Sean or Sandy can
elaborate more given their experience with Oryx.

There is work by Smola et al at Google that talks about large scale model
update and deployment.
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu

-Manish

On Sunday, November 16, 2014, Charles Earl charles.ce...@gmail.com wrote:

 Manish and others,
 A follow up question on my mind is whether there are protobuf (or other
 binary format) frameworks in the vein of PMML. Perhaps scientific data
 storage frameworks like netcdf, root are possible also.
 I like the comprehensiveness of PMML but as you mention the complexity of
 management for large models is a concern.
 Cheers

 On Fri, Nov 14, 2014 at 1:35 AM, Manish Amde manish...@gmail.com
 javascript:_e(%7B%7D,'cvml','manish...@gmail.com'); wrote:

 @Aris, we are closely following the PMML work that is going on and as
 Xiangrui mentioned, it might be easier to migrate models such as logistic
 regression and then migrate trees. Some of the models get fairly large (as
 pointed out by Sung Chung) with deep trees as building blocks and we might
 have to consider a distributed storage and prediction strategy.


 On Tuesday, November 11, 2014, Xiangrui Meng men...@gmail.com
 javascript:_e(%7B%7D,'cvml','men...@gmail.com'); wrote:

 Vincenzo sent a PR and included k-means as an example. Sean is helping
 review it. PMML standard is quite large. So we may start with simple
 model export, like linear methods, then move forward to tree-based.
 -Xiangrui

 On Mon, Nov 10, 2014 at 11:27 AM, Aris arisofala...@gmail.com wrote:
  Hello Spark and MLLib folks,
 
  So a common problem in the real world of using machine learning is
 that some
  data analysis use tools like R, but the more data engineers out
 there will
  use more advanced systems like Spark MLLib or even Python Scikit Learn.
 
  In the real world, I want to have a system where multiple different
  modeling environments can learn from data / build models, represent the
  models in a common language, and then have a layer which just takes the
  model and run model.predict() all day long -- scores the models in
 other
  words.
 
  It looks like the project openscoring.io and jpmml-evaluator are some
  amazing systems for this, but they fundamentally use PMML as the model
  representation here.
 
  I have read some JIRA tickets that Xiangrui Meng is interested in
 getting
  PMML implemented to export MLLib models, is that happening? Further,
 would
  something like Manish Amde's boosted ensemble tree methods be
 representable
  in PMML?
 
  Thank you!!
  Aris

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 - Charles



Re: Status of MLLib exporting models to PMML

2014-11-13 Thread Manish Amde
@Aris, we are closely following the PMML work that is going on and as
Xiangrui mentioned, it might be easier to migrate models such as logistic
regression and then migrate trees. Some of the models get fairly large (as
pointed out by Sung Chung) with deep trees as building blocks and we might
have to consider a distributed storage and prediction strategy.


On Tuesday, November 11, 2014, Xiangrui Meng men...@gmail.com wrote:

 Vincenzo sent a PR and included k-means as an example. Sean is helping
 review it. PMML standard is quite large. So we may start with simple
 model export, like linear methods, then move forward to tree-based.
 -Xiangrui

 On Mon, Nov 10, 2014 at 11:27 AM, Aris arisofala...@gmail.com
 javascript:; wrote:
  Hello Spark and MLLib folks,
 
  So a common problem in the real world of using machine learning is that
 some
  data analysis use tools like R, but the more data engineers out there
 will
  use more advanced systems like Spark MLLib or even Python Scikit Learn.
 
  In the real world, I want to have a system where multiple different
  modeling environments can learn from data / build models, represent the
  models in a common language, and then have a layer which just takes the
  model and run model.predict() all day long -- scores the models in other
  words.
 
  It looks like the project openscoring.io and jpmml-evaluator are some
  amazing systems for this, but they fundamentally use PMML as the model
  representation here.
 
  I have read some JIRA tickets that Xiangrui Meng is interested in getting
  PMML implemented to export MLLib models, is that happening? Further,
 would
  something like Manish Amde's boosted ensemble tree methods be
 representable
  in PMML?
 
  Thank you!!
  Aris

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:;
 For additional commands, e-mail: user-h...@spark.apache.org javascript:;




Re: Anybody built the branch for Adaptive Boosting, extension to MLlib by Manish Amde?

2014-09-18 Thread Manish Amde
Hi Aris,


Thanks for the interest. First and foremost, tree ensembles are a top priority 
for the 1.2 release and we are working hard towards it. A random forests PR is 
already under review and AdaBoost and gradient boosting will be added soon 
after. 




Unfortunately, the GBDT branch I shared is way off master. There has been a lot 
of under-the-hood optimizations for decision trees and I am not surprised that 
the branch doesn't compile. It will be best if you could wait for a few days 
till I make the branch compatible with the latest master.





Again, thanks for your interest in boosting algos. We are eager to add them to 
MLlib ASAP.

On Thu, Sep 18, 2014 at 7:27 PM, Aris arisofala...@gmail.com wrote:

 Thank you Spark community you make life much more lovely - suffering in
 silence is not fun!
 I am trying to build the Spark Git branch from Manish Amde, available here:
 https://github.com/manishamde/spark/tree/ada_boost
 I am trying to build the non-master branch 'ada_boost' (in the link above),
 but './sbt/sbt assembly' does not work, as it sees all kinds of new code
 that doesn't build. I saw another script at the top-level called
 'make-distribution.sh' which requires maven and specifically Java 6 (does
 not allow javac version 7), but that also fails.
 Does anybody have any pointers for building this developmental build of
 Spark with support for adaptive boosting (adaboost ensemble decision tree
 method) in MLlib?
 Thanks!

Re: Gradient Boosted Machines

2014-08-05 Thread Manish Amde
Hi Daniel,

Thanks a lot for your interest. Gradient boosting and AdaBoost algorithms
are under active development and should be a part of release 1.2.

-Manish


On Mon, Jul 14, 2014 at 11:24 AM, Daniel Bendavid 
daniel.benda...@creditkarma.com wrote:

  Hi,

  My company is strongly considering implementing a recommendation engine
 that is built off of statistical models using Spark.  We attended the Spark
 Summit and were incredibly impressed with the technology and the entire
 community.  Since then, we have been exploring the technology and
 determining how we could use it for our specific needs.

  One algorithm that we ideally want to use as part of our project is
 Gradient Boosted Machines.  We are aware that they have not yet been
 implemented in MLib and would like to submit our request that they be
 considered for future implementation.  Additionally, we would love to see
 the AdaBoost algorithm implemented in Mlib and Feature Preprocessing
 implemented in Python (as it already exists for Scala).

  Otherwise, thank you for taking our feedback and for providing us with
 this incredible technology.

  Daniel



Re: MLLib : Decision Tree with minimum points per node

2014-06-19 Thread Manish Amde
Hi Justin,

I am glad to know that trees are working well for you.

The trees will support minimum samples per node in a future release. Thanks
for the feedback.

-Manish


On Fri, Jun 13, 2014 at 8:55 PM, Justin Yip yipjus...@gmail.com wrote:

 Hello,

 I have been playing around with mllib's decision tree library. It is
 working great, thanks.

 I have a question regarding overfitting. It appears to me that the current
 implementation doesn't allows user to specify the minimum number of samples
 per node. This results in some nodes only contain very few samples, which
 potentially leads to overfitting.

 I would like to know if there is workaround or any way to prevent
 overfitting? Or will decision tree supports min-samples-per-node in future
 releases?

 Thanks.

 Justin





Re: MLLib : Decision Tree with minimum points per node

2014-06-19 Thread Manish Amde
Hi Justin,

I have created a JIRA ticket to keep track of your request. Thanks.
https://issues.apache.org/jira/browse/SPARK-2207

-Manish


On Thu, Jun 19, 2014 at 2:35 PM, Manish Amde manish...@gmail.com wrote:

 Hi Justin,

 I am glad to know that trees are working well for you.

 The trees will support minimum samples per node in a future release.
 Thanks for the feedback.

 -Manish


 On Fri, Jun 13, 2014 at 8:55 PM, Justin Yip yipjus...@gmail.com wrote:

 Hello,

 I have been playing around with mllib's decision tree library. It is
 working great, thanks.

 I have a question regarding overfitting. It appears to me that the
 current implementation doesn't allows user to specify the minimum number of
 samples per node. This results in some nodes only contain very few samples,
 which potentially leads to overfitting.

 I would like to know if there is workaround or any way to prevent
 overfitting? Or will decision tree supports min-samples-per-node in future
 releases?

 Thanks.

 Justin






Re: MLLib : Decision Tree not getting built for 5 or more levels(maxDepth=5) and the one built for 3 levels is performing poorly

2014-06-15 Thread Manish Amde
Hi Suraj,

I don't see any logs from mllib. You might need to explicit set the logging
to DEBUG for mllib. Adding this line for log4j.properties might fix the
problem.
log4j.logger.org.apache.spark.mllib.tree=DEBUG

Also, please let me know if you can encounter similar problems with the
Spark master.

-Manish


On Sat, Jun 14, 2014 at 3:19 AM, SURAJ SHETH shet...@gmail.com wrote:

 Hi Manish,
 Thanks for your reply.

 I am attaching the logs here(regression, 5 levels). It contains the last
 100s of lines. Also, I am attaching the screenshot of Spark UI. The first 4
 levels complete in less than 6 seconds, while the 5th level doesn't
 complete even after several hours.
 Due to the reason that this is somebody else's data, I can't share it.

 Can you check the code snippet attached in my first email and see if it
 needs something to enable it to work for large data and = 5 levels. It is
 working for 3 levels on the same dataset, but, not for 5 levels.

 In the mean time, I will try to run it on the latest master and let you
 know the results. If it runs fine there, then, it can be related to 128 MB
 limit issue that you mentioned.

 Thanks and Regards,
 Suraj Sheth



 On Sat, Jun 14, 2014 at 12:05 AM, Manish Amde manish...@gmail.com wrote:

 Hi Suraj,

 I can't answer 1) without knowing the data. However, the results for 2)
 are surprising indeed. We have tested with a billion samples for regression
 tasks so I am perplexed with the behavior.

 Could you try the latest Spark master to see whether this problem goes
 away. It has code that limits memory consumption at the master and worker
 nodes to 128 MB by default which ideally should not be needed given the
 amount of RAM on your cluster.

 Also, feel free to send the DEBUG logs. It might give me a better idea of
 where the algorithm is getting stuck.

 -Manish



 On Wed, Jun 11, 2014 at 1:20 PM, SURAJ SHETH shet...@gmail.com wrote:

 Hi Filipus,
 The train data is already oversampled.
 The number of positives I mentioned above is for the test dataset :
 12028 (apologies for not making this clear earlier)
 The train dataset has 61,264 positives out of 689,763 total rows. The
 number of negatives is 628,499.
 Oversampling was done for the train dataset to ensure that we have
 atleast 9-10% of positives in the train part
 No oversampling is done for the test dataset.

 So, the only difference that remains is the amount of data used for
 building a tree.

 But, I have a few more questions :
 Have we tried how much data can be used at most to build a single
 Decision Tree.
 Since, I have enough RAM to fit all the data into memory(only 1.3 GB of
 train data and 30x3 GB of RAM), I would expect it to build a single
 Decision Tree with all the data without any issues. But, for maxDepth = 5,
 it is not able to. I confirmed that when it keeps running for hours, the
 amount of free memory available is more than 70%. So, it doesn't seem to be
 a Memory issue either.


 Thanks and Regards,
 Suraj Sheth


 On Wed, Jun 11, 2014 at 10:19 PM, filipus floe...@gmail.com wrote:

 well I guess your problem is quite unbalanced and due to the information
 value as a splitting criterion I guess the algo stops after very view
 splits

 work arround is oversampling

 build many training datasets like

 take randomly 50% of the positives and from the negativ the same amount
 or
 let say the double

 = 6000 positives and 12000 negatives

 build a tree

 this you do many times = many models (agents)

 and than you make an ensemble model. means vote all the model

 in a way similar two random forest but at the completely different



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Decision-Tree-not-getting-built-for-5-or-more-levels-maxDepth-5-and-the-one-built-for-3-levelsy-tp7401p7405.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.







Re: Random Forest on Spark

2014-04-18 Thread Manish Amde
Sorry for arriving late to the party! Evan has clearly explained the
current implementation, our future plans and key differences with the
PLANET paper. I don't think I can add more to his comments. :-)

I apologize for not creating the corresponding JIRA tickets for the tree
improvements (multiclass classification, deep trees, post-shuffle
single-machine computation for small datasets, code refactoring for
pluggable loss calculation) and ensembles tree (RF, GBT, AdaBoost,
ExtraTrees, partial implementation of RF). I will create them soon.

We are currently working on creating very fast ensemble trees which will be
different from current ensemble tree implementations in other libraries.
PR's for tree improvements will be great -- just make sure you go carefully
through the tree code which I think is fairly well-documented but
non-trivial to understand and discuss your changes on JIRA before
implementation to avoid duplication.

-Manish


On Fri, Apr 18, 2014 at 8:43 AM, Evan R. Sparks evan.spa...@gmail.comwrote:

 Interesting, and thanks for the thoughts.

 I think we're on the same page with 100s of millions of records. We've
 tested the tree implementation in mllib on 1b rows and up to 100 features -
 though this isn't hitting the 1000s of features you mention.

 Obviously multi class support isn't there yet, but I can see your point
 about deeper trees for many class problems. Will try them out on some image
 processing stuff with 1k classes we're doing in the lab once they are more
 developed to get a sense for where the issues are.

 If you're only allocating 2GB/worker you're going to have a hard time
 getting the real advantages of Spark.

 For your 1k features causing heap exceptions at depth 5  - are these
 categorical or continuous? The categorical vars create much smaller
 histograms.

 If you're fitting all continuous features, the memory requirements are
 O(b*d*2^l) where b=number of histogram bins, d=number of features, and l =
 level of the tree. Even accounting for object overhead, with the default
 number of bins, the histograms at this depth should be order of 10s of MB,
 not 2GB - so I'm guessing your cached data is occupying a significant chunk
 of that 2GB? In the tree PR - Hirakendu Das tested down to depth 10 on 500m
 data points with 20 continuous features and was able to run without running
 into memory issues (and scaling properties got better as the depth grew).
 His worker mem was 7.5GB and 30% of that was reserved for caching. If you
 wanted to go 1000 features at depth 10 I'd estimate a couple of gigs
 necessary for heap space for the worker to compute/store the histograms,
 and I guess 2x that on the master to do the reduce.

 Again 2GB per worker is pretty tight, because there are overheads of just
 starting the jvm, launching a worker, loading libraries, etc.

 - Evan

 On Apr 17, 2014, at 6:10 PM, Sung Hwan Chung coded...@cs.stanford.edu
 wrote:

 Yes, it should be data specific and perhaps we're biased toward the data
 sets that we are playing with. To put things in perspective, we're highly
 interested in (and I believe, our customers are):

 1. large (hundreds of millions of rows)
 2. multi-class classification - nowadays, dozens of target categories are
 common and even thousands in some cases - you could imagine that this is a
 big reason for us requiring more 'complex' models
 3. high dimensional with thousands of descriptive and sort-of-independent
 features

 From the theoretical perspective, I would argue that it's usually in the
 best interest to prune as little as possible. I believe that pruning
 inherently increases bias of an individual tree, which RF can't do anything
 about while decreasing variance - which is what RF is for.

 The default pruning criteria for R's reference implementation is min-node
 of 1 (meaning fully-grown tree) for classification, and 5 for regression.
 I'd imagine they did at least some empirical testing to justify these
 values at the time - although at a time of small datasets :).

 FYI, we are also considering the MLLib decision tree for our Gradient
 Boosting implementation, however, the memory requirement is still a bit too
 steep (we were getting heap exceptions at depth limit of 5 with 2GB per
 worker with approximately 1000 features). Now 2GB per worker is about what
 we expect our typical customers would tolerate and I don't think that it's
 unreasonable for shallow trees.



 On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks evan.spa...@gmail.comwrote:

 What kind of data are you training on? These effects are *highly* data
 dependent, and while saying the depth of 10 is simply not adequate to
 build high-accuracy models may be accurate for the particular problem
 you're modeling, it is not true in general. From a statistical perspective,
 I consider each node in each tree an additional degree of freedom for the
 model, and all else equal I'd expect a model with fewer degrees of freedom
 to generalize better. Regardless, if