[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964 ] Weichen Xu edited comment on SPARK-10078 at 1/12/17 4:59 AM: - [~debasish83] Considering make VF-LBFGS/VF-OWLQN supporting generic distributed vector interface (move into breeze ?) and make them support multiple distributed platform(not only spark) will make the optimization against spark platform difficult I think, because when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been actually persisted(although the persist API called), but already been unpersisted, so that such awful situation will cause the whole RDD lineage recomputation. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? As the detail problems I list above(I only list a small part problems), in my opinion, breeze can provide the following base class and/or abstract interface: * FirstOrderMinimizer * DiffFunction interface * LineSearch implementation (including StrongWolfeLinsearch and BacktrackingLinesearch) * DistributedVector abstract interface *BUT*, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be implemented in spark mllib, for better optimization. was (Author: weichenxu123): [~debasish83] Considering make VF-LBFGS/VF-OWLQN supporting generic distributed vector interface (move into breeze ?) and make them support multiple distributed platform(not only spark) will make the optimization against spark platform difficult I think, because when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964 ] Weichen Xu edited comment on SPARK-10078 at 1/12/17 4:58 AM: - [~debasish83] Considering make VF-LBFGS/VF-OWLQN supporting generic distributed vector interface (move into breeze ?) and make them support multiple distributed platform(not only spark) will make the optimization against spark platform difficult I think, because when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been actually persisted(although the persist API called), but already been unpersisted, so that such awful situation will cause the whole RDD lineage recomputation. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? As the detail problems I list above(I only list a small part problems), in my opinion, breeze can provide the following base class and/or abstract interface: * FirstOrderMinimizer * DiffFunction interface * LineSearch implementation (including StrongWolfeLinsearch and BacktrackingLinesearch) * DistributedVector abstract interface *BUT*, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be implemented in spark mllib, for better optimization. was (Author: weichenxu123): [~debasish83] Considering make VF-LBFGS/VF-OWLQN supporting generic distributed vector interface (move into breeze ?) and make them support multiple distributed platform(not only spark) will make the optimization against spark platform difficult I think, because when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820180#comment-15820180 ] Weichen Xu edited comment on SPARK-10078 at 1/12/17 4:54 AM: - As the detail problems I list above(I only list a small part problems), in my opinion, breeze can provide the following base class and/or abstract interface: FirstOrderMinimizer DiffFunction interface LineSearch implementation (including StrongWolfeLinsearch and BacktrackingLinesearch) DistributedVector abstract interface BUT, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be implemented in spark mllib, for better optimization. was (Author: weichenxu123): As the detail problems I list above(I only list a small part problems), in my opinion, breeze can provide the following base class and/or abstract interface: FirstOrderMinimizerlevel DiffFunction interface LineSearch implementation (including StrongWolfeLinsearch and BacktrackingLinesearch) DistributedVector abstract interface BUT, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be implemented in spark mllib, for better optimization. > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820180#comment-15820180 ] Weichen Xu edited comment on SPARK-10078 at 1/12/17 4:54 AM: - As the detail problems I list above(I only list a small part problems), in my opinion, breeze can provide the following base class and/or abstract interface: FirstOrderMinimizerlevel DiffFunction interface LineSearch implementation (including StrongWolfeLinsearch and BacktrackingLinesearch) DistributedVector abstract interface BUT, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be implemented in spark mllib, for better optimization. was (Author: weichenxu123): As the detail problems I list above(I only list a small part problems), in my opinion, breeze can provide the following base class and/or abstract interface FirstOrderMinimizerlevel DiffFunction interface LineSearch implementation (including StrongWolfeLinsearch and BacktrackingLinesearch) DistributedVector abstract interface BUT, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be implemented in spark mllib, for better optimization. > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819851#comment-15819851 ] Weichen Xu edited comment on SPARK-10078 at 1/12/17 4:43 AM: - [~debasish83] Can L-BFGS-B be distributed computed when scaled to billions of features in high efficiency ? If only the interface supporting distributed vector, but internal computation still use local vector and/or local matrix, then it seems won't make much sense... Currently VF-LBFGS can turn LBFGS two loop recursion into distributed computing mode, but the L-BFGS-B seems much more complex then L-BFGS, can it also be computed in parallel ? I look into L-BFGS-B code in breeze and the core updating Hessian and computing descent direction in L-BFGS-B is very complex, this part it cannot reuse LBFGS code. So, through which way LBFGS-B can take advantage of `Vector-free LBFGS` ? was (Author: weichenxu123): [~debasish83] Can L-BFGS-B be distributed computed when scaled to billions of features in high efficiency ? If only the interface supporting distributed vector, but internal computation still use local vector and/or local matrix, then it seems won't make much sense... Currently VF-LBFGS can turn LBFGS two loop recursion into distributed computing mode, but the L-BFGS-B seems much more complex then L-BFGS, can it also be computed in parallel ? > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964 ] Weichen Xu edited comment on SPARK-10078 at 1/12/17 3:02 AM: - [~debasish83] Considering make VF-LBFGS/VF-OWLQN supporting generic distributed vector interface (move into breeze ?) and make them support multiple distributed platform(not only spark) will make the optimization against spark platform difficult I think, because when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been actually persisted(although the persist API called), but already been unpersisted, so that such awful situation will cause the whole RDD lineage recomputation. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? was (Author: weichenxu123): [~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been actually persisted(although the persist API called), but already been unpersisted, so that such awful situation will cause the whole RDD lineage recomputation. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL:
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964 ] Weichen Xu edited comment on SPARK-10078 at 1/12/17 2:55 AM: - [~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been actually persisted(although the persist API called), but already been unpersisted, so that such awful situation will cause the whole RDD lineage recomputation. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? was (Author: weichenxu123): [~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been actually persisted(although the persist API called), but already been unpersisted, so that such awful situation will cause the whole RDD lineage recomputation. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? [~sethah] Do you consider this detail problems when you designing the general optimizer interface ? > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964 ] Weichen Xu edited comment on SPARK-10078 at 1/12/17 2:48 AM: - [~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been actually persisted(although the persist API called), but already been unpersisted, so that such awful situation will cause the whole RDD lineage recomputation. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? [~sethah] Do you consider this detail problems when you designing the general optimizer interface ? was (Author: weichenxu123): [~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been actually persisted(although the persist API called), but already been unpersisted, so that such awful situation will cause the whole RDD lineage recomputation. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? [~sethah] Do you consider this detail problems when you designing the general optimizer interface ? > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee:
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819964#comment-15819964 ] Weichen Xu edited comment on SPARK-10078 at 1/12/17 2:45 AM: - [~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been actually persisted(although the persist API called), but already been unpersisted, so that such awful situation will cause the whole RDD lineage recomputation. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? [~sethah] Do you consider this detail problems when you designing the general optimizer interface ? was (Author: weichenxu123): [~debasish83] But when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many optimizations need to combine spark features and the optimizer algorithm closely, make a abstract interface supporting distributed vector (for example, Vector space operator include dot, add, scale, persist/unpersist operators and so on...) seems not enough. I give two simple problem to show the complexity when considering general interface: 1. Look this VF-OWLQN implementation based on spark: https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala We know that OWLQN internal will help compute the pseudo-gradient for L1 reg, look the code function `calculateComponentWithL1`, here when computing pseudo-gradient using RDD, it also use an accumulator(only spark have) to calculate the adjusted fnValue, so that will the abstract interface containing something about `accumulator` in spark ? 2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy computation feature, improper persist/unpersist/checkpoint order may cause serious problem (may cause RDD recomputation, checkpoint take no effect and so on), about this complexity, we can take a look into the VF-BFGS implementation on spark: it use the pattern "persist current step RDDs, then unpersist previous step RDDs" like many other algos in spark mllib. The complexity is at, spark always do lazy computation, when you persist RDD, it do not persist immediately, but postponed to RDD.action called. If the internal code call `unpersist` too early, it will cause the problem that an RDD haven't been computed and haven't been persisted, but already been unpersisted. This feature may be much different than other distributed platform, so that a general interface can really handle this problem correctly and still keep high efficient in the same time? [~sethah] Do you consider this detail problems when you designing the general optimizer interface ? > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: >
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804899#comment-15804899 ] Yanbo Liang edited comment on SPARK-10078 at 1/6/17 4:27 PM: - [~debasish83] We are aim to implement VL-BFGS as an optimizer for ~billion features in the peer position compared with Breeze LBFGS/OWLQN, and then ML algorithms can switch between them automatically based on the number of features. So an abstract interface between the algorithms and optimizers is absolutely necessary. To the VL-BFGS, I have a basic implementation at https://github.com/yanboliang/spark-vlbfgs, please feel free to review and comment the code. Thanks. was (Author: yanboliang): [~debasish83] We are aim to implement VL-BFGS as an optimizer which should be similar with Breeze LBFGS/OWLQN, and switching between them should be automatically based on the number of features. So an abstract interface between the algorithm and optimizer is really necessary. I have a basic implementation at https://github.com/yanboliang/spark-vlbfgs, please feel free to review and comment the code. Thanks. > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793760#comment-15793760 ] Debasish Das edited comment on SPARK-10078 at 1/3/17 12:26 AM: --- Ideally feature partitioning should be automatically tuned...at 100M features master only processing what we do with Breeze LBFGS / OWLQN will also get benefitted by VL-BFGSIdeally it should be part of breeze and a proper interface should be defined so that the Breeze VL-BFGS solver can be called in Spark ML...There are bounded BFGS that's in breeze...all of them will be benefited by this change. A solver can be used in other frameworks as well and may not be constrained to RDD if possible... was (Author: debasish83): Ideally feature partitioning should be automatically tuned...at 100M features master only processing what we do with Breeze LBFGS / OWLQN will also get benefitted by VL-BFGSIdeally it should be part of breeze and a proper interface should be defined so that the Breeze VL-BFGS solver can be called in Spark ML... > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978922#comment-14978922 ] Kotaro Tanahashi edited comment on SPARK-10078 at 10/28/15 6:27 PM: When vector-free L-BFGS applies 2D partitioning to the training data, is it necessary to create Gradient class (such as LogisticGradient or HingeGradient) for distributed feature data? was (Author: tana): When vector-free L-BFGS applies 2D partitioning to the training data, is it necessary to create Gradient class (such as LogisticGradient or HingeGradient) for 2D distributed data? > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978922#comment-14978922 ] Kotaro Tanahashi edited comment on SPARK-10078 at 10/28/15 6:10 PM: When vector-free L-BFGS applies 2D partitioning to the training data, is it necessary to create 2D distributed version of Gradient class (such as LogisticGradient or HingeGradient)? was (Author: tana): When vector-free L-BFGS applies 2D partitioning to the training data, is it necessary to create 2D distributed version of Gradient class, such as LogisticGradient or HingeGradient? > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978922#comment-14978922 ] Kotaro Tanahashi edited comment on SPARK-10078 at 10/28/15 6:12 PM: When vector-free L-BFGS applies 2D partitioning to the training data, is it necessary to create Gradient class (such as LogisticGradient or HingeGradient) for 2D distributed data? was (Author: tana): When vector-free L-BFGS applies 2D partitioning to the training data, is it necessary to create 2D distributed version of Gradient class (such as LogisticGradient or HingeGradient)? > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org