[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...

2017-01-09 Thread zdh2292390
Github user zdh2292390 commented on the issue:

https://github.com/apache/spark/pull/16415
  
@jkbradley @srowen   Can anyone check my latest commit  please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...

2017-01-05 Thread zdh2292390
Github user zdh2292390 commented on the issue:

https://github.com/apache/spark/pull/16415
  
@jkbradley   Accually, I see there are quite a few points  to improve  
MLlib GBDT.
First, we should stop using the RandomForest  API  and make a private API  
for GBDT  because  there are a lot of redundancy or reduplicative operations in 
RandomForest  API   such as  buildMetadata  、findSplitsBins and  
convertToTreeRDD. These operations  accually only need once for GBDT  and not 
necessary to do for every tree.

I'd like to change that  but I need some advices  to avoid  break the 
existing architecture。


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...

2017-01-05 Thread zdh2292390
Github user zdh2292390 commented on the issue:

https://github.com/apache/spark/pull/16415
  
@jkbradley  I've changed code to specify the storageLevel  via a Param.  
Please check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...

2017-01-03 Thread zdh2292390
Github user zdh2292390 commented on the issue:

https://github.com/apache/spark/pull/16415
  
@jkbradley  OK ,I will close the issue.  Do I need to close the PR ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...

2017-01-03 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/16415
  
@zdh2292390 Thanks for the update.  Given that this will change behavior 
for existing workloads, I'll ask that we specify it via a Param.  Also, I'm 
going to create a new JIRA for this since it has changed significantly since 
the original JIRA reporting the failure.  Could you please close this issue and 
instead work from this new JIRA?  
https://issues.apache.org/jira/browse/SPARK-19063
Thank you!

I'll add the other TODOs to your original JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...

2016-12-29 Thread zdh2292390
Github user zdh2292390 commented on the issue:

https://github.com/apache/spark/pull/16415
  
@jkbradley Yes,changing the storageLevel in predErrorCheckpointer can fix 
the problem.

Can you please merge my latest commit which just changed the storageLevel 
in predErrorCheckpointer?

When you begin to fix the rest problem  can i join in?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...

2016-12-29 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/16415
  
Thanks for checking!

Does changing the storageLevel in predErrorCheckpointer fix the problem?

"other use cases": Well, I remember thinking about this a lot when adding 
the periodic checkpointer, and it had to do with the fact that RDDs may be 
materialized later than checkpointer.update() gets called.  Now that I look 
again, it's possible that we could maintain 2 instead of 3 cached RDDs in the 
checkpointer's persistedQueue, but I'd want to check this more carefully.

Problem with 2 RDDs left cached after the loop: This could be fixed by 
adding a finalize() method to trait LDAOptimizer which can clean up the extra 
cached RDD.  Unfortunately, now that the trait is public, we cannot change it.  
This fix will need to wait until we move the implementation to the spark.ml 
package, at which time we can fix the API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...

2016-12-28 Thread zdh2292390
Github user zdh2292390 commented on the issue:

https://github.com/apache/spark/pull/16415
  
@jkbradley   There is another problem :   predErrorCheckpointer  did not 
unpersist  the RDD  in the queue  after  the loop.  There will be 2 RDD left  
cached  after the loop .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...

2016-12-28 Thread zdh2292390
Github user zdh2292390 commented on the issue:

https://github.com/apache/spark/pull/16415
  
@jkbradley   I test with MEMORY_ONLY  and got the same problem: 
`(ExecutorLostFailure (executor 6 exited caused by one of the running 
tasks) Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 
10 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.)`
and  it cost more than 1 hour.

So  my code should just  change the storageLevel  in predErrorCheckpointer ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...

2016-12-28 Thread zdh2292390
Github user zdh2292390 commented on the issue:

https://github.com/apache/spark/pull/16415
  
@jkbradley   Can I  ask  what other uses are  for "The periodic 
checkpointer caches more because of some other use cases" ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...

2016-12-28 Thread zdh2292390
Github user zdh2292390 commented on the issue:

https://github.com/apache/spark/pull/16415
  
 Thx,  you are so clearly.  I  found  i did something stupid , my test code 
accualy doesn't  include predErrorCheckpointer(I shut it down),   so  my 
commit code is not right  which  should delete the persist in 
predErrorCheckpointer.
I am trying my test with MEMORY_ONLY instead of MEMORY_AND_DISK now  and we 
will know what really caused the improvement in 1 hour.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...

2016-12-28 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/16415
  
predErrorCheckpointer should already be persisting and unpersisting 
predError.  This PR's changes will mean:
* persist will use MEMORY_AND_DISK instead of MEMORY_ONLY
* 1 (instead of 3) predError RDDs will be persisted at any one time.  (The 
periodic checkpointer caches more because of some other use cases.)

I recommend we figure out which of the 2 changes above are causing the 
improvement for your use case.  Could you please try your test with MEMORY_ONLY 
instead of MEMORY_AND_DISK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...

2016-12-27 Thread zdh2292390
Github user zdh2292390 commented on the issue:

https://github.com/apache/spark/pull/16415
  
Can one of the admins verify this patch?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...

2016-12-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16415
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org