[jira] [Commented] (SYSTEMML-2083) Language and runtime for parameter servers
[ https://issues.apache.org/jira/browse/SYSTEMML-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368702#comment-16368702 ] LI Guobao commented on SYSTEMML-2083: - Hello, I'm LI Guobao, 2nd year computer science in master from South China University of Technology. And I have also graduated from Polytech Nantes in France for a master degree of Computer Science. Nowadays, I'm contributor of an open-source project Alien4Cloud and also have some experience in distributed computing framework. And I'm extremely interested in contributing to this project. Thank you in advance > Language and runtime for parameter servers > -- > > Key: SYSTEMML-2083 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2083 > Project: SystemML > Issue Type: Epic >Reporter: Matthias Boehm >Priority: Major > Labels: gsoc2018 > Attachments: image-2018-02-14-12-18-48-932.png, > image-2018-02-14-12-21-00-932.png, image-2018-02-14-12-31-37-563.png > > > SystemML already provides a rich set of execution strategies ranging from > local operations to large-scale computation on MapReduce or Spark. In this > context, we support both data-parallel (multi-threaded or distributed > operations) as well as task-parallel computation (multi-threaded or > distributed parfor loops). This epic aims to complement the existing > execution strategies by language and runtime primitives for parameter > servers, i.e., model-parallel execution. We use the terminology of > model-parallel execution with distributed data and distributed model to > differentiate them from the existing data-parallel operations. Target > applications are distributed deep learning and mini-batch algorithms in > general. These new abstractions will help making SystemML a unified framework > for small- and large-scale machine learning that supports all three major > execution strategies in a single framework. > > A major challenge is the integration of stateful parameter servers and their > common push/pull primitives into an otherwise functional (and thus, > stateless) language. We will approach this challenge via a new builtin > function {{paramserv}} which internally maintains state but at the same time > fits into the runtime framework of stateless operations. > Furthermore, we are interested in providing (1) different runtime backends > (local and distributed), (2) different parameter server modes (synchronous, > asynchronous, hogwild!, stale-synchronous), (3) different update frequencies > (batch, multi-batch, epoch), as well as (4) different architectures for > distributed data (1 parameter server, k workers) and distributed model (k1 > parameter servers, k2 workers). > > *Note for GSOC students:* This is large project which will be broken down > into sub projects, so everybody will be having their share of pie. > *Prerequistes:* Java, machine learning experience is a plus but not required. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2083) Language and runtime for parameter servers
[ https://issues.apache.org/jira/browse/SYSTEMML-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370151#comment-16370151 ] LI Guobao commented on SYSTEMML-2083: - Hi, I have launched some samples of SystemML locally on my pc and looked through the papers above about SystemML as well as the parameter servers. In my head I gain an overview about the 3 strategies about updating the parameters. I'm motivated to know more about the runtime of buildin function and the off-the-shelf solution for the parameter servers. So I'd like to know if I can join the discussion group for this subject? Thank you in advance for the response, Guobao > Language and runtime for parameter servers > -- > > Key: SYSTEMML-2083 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2083 > Project: SystemML > Issue Type: Epic >Reporter: Matthias Boehm >Priority: Major > Labels: gsoc2018 > Attachments: image-2018-02-14-12-18-48-932.png, > image-2018-02-14-12-21-00-932.png, image-2018-02-14-12-31-37-563.png > > > SystemML already provides a rich set of execution strategies ranging from > local operations to large-scale computation on MapReduce or Spark. In this > context, we support both data-parallel (multi-threaded or distributed > operations) as well as task-parallel computation (multi-threaded or > distributed parfor loops). This epic aims to complement the existing > execution strategies by language and runtime primitives for parameter > servers, i.e., model-parallel execution. We use the terminology of > model-parallel execution with distributed data and distributed model to > differentiate them from the existing data-parallel operations. Target > applications are distributed deep learning and mini-batch algorithms in > general. These new abstractions will help making SystemML a unified framework > for small- and large-scale machine learning that supports all three major > execution strategies in a single framework. > > A major challenge is the integration of stateful parameter servers and their > common push/pull primitives into an otherwise functional (and thus, > stateless) language. We will approach this challenge via a new builtin > function {{paramserv}} which internally maintains state but at the same time > fits into the runtime framework of stateless operations. > Furthermore, we are interested in providing (1) different runtime backends > (local and distributed), (2) different parameter server modes (synchronous, > asynchronous, hogwild!, stale-synchronous), (3) different update frequencies > (batch, multi-batch, epoch), as well as (4) different architectures for > distributed data (1 parameter server, k workers) and distributed model (k1 > parameter servers, k2 workers). > > *Note for GSOC students:* This is large project which will be broken down > into sub projects, so everybody will be having their share of pie. > *Prerequistes:* Java, machine learning experience is a plus but not required. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2077) New eval builtin function
[ https://issues.apache.org/jira/browse/SYSTEMML-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372920#comment-16372920 ] LI Guobao commented on SYSTEMML-2077: - Hello, Can you give me some information about how the function "eval" works in real case? I'd like to take this ticket. Regards, Guobao > New eval builtin function > - > > Key: SYSTEMML-2077 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2077 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Priority: Major > > This new eval builtin function aims to provide a concise language construct > to evaluate dynamic expressions and functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2077) New eval builtin function
[ https://issues.apache.org/jira/browse/SYSTEMML-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374310#comment-16374310 ] LI Guobao commented on SYSTEMML-2077: - Hi [~mboehm7], I have a problem of compiling the code. I wanted to launch a test in the class MLContextTest. And I found the errors of compilation because of the missing classes such as PydmlParser, DmlParser etc... And I saw that they were excluded in the file .gitignore. So how can I compile the code for launch a test? Regards, Guobao > New eval builtin function > - > > Key: SYSTEMML-2077 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2077 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Priority: Major > > This new eval builtin function aims to provide a concise language construct > to evaluate dynamic expressions and functions. > Similar to R's eval function > (https://stat.ethz.ch/R-manual/R-devel/library/base/html/eval.html), this > would allow us to evaluate dynamically constructed expressions. There are two > major sub tasks here: the invocation of given function pointers and the > evaluation of dynamic expressions given as strings. Initially, we would focus > on the former by allowing call such as {{R = eval(fname, A, B, C)}}. So far > SystemML does not provide second-order functions which requires explicit > {{if-else}} conditions for ensemble learning workloads. With this new > {{eval}} function we could store a list of function names in a frame {{F}} > and dynamically call them via {{R[i, ] = eval(F[i,1], A, B, C)}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2078) Support for global constants
[ https://issues.apache.org/jira/browse/SYSTEMML-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391518#comment-16391518 ] LI Guobao commented on SYSTEMML-2078: - I have an idea about this. Could we add a built-in function to attribute a value to a global constant variable like this "global(size, 10)"? > Support for global constants > > > Key: SYSTEMML-2078 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2078 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Priority: Major > > Similar to R, where variables of the surrounding context are accessible, this > task aims to introduce global constant variables. Furthermore, we should also > add builtin constants such as NaN, INF, and PI. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (SYSTEMML-2078) Support for global constants
[ https://issues.apache.org/jira/browse/SYSTEMML-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391518#comment-16391518 ] LI Guobao edited comment on SYSTEMML-2078 at 3/8/18 9:50 PM: - [~mboehm7] I have an idea about this. Could we add a built-in function to attribute a value to a global constant variable like this "global(size, 10)"? was (Author: guobao): I have an idea about this. Could we add a built-in function to attribute a value to a global constant variable like this "global(size, 10)"? > Support for global constants > > > Key: SYSTEMML-2078 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2078 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Priority: Major > > Similar to R, where variables of the surrounding context are accessible, this > task aims to introduce global constant variables. Furthermore, we should also > add builtin constants such as NaN, INF, and PI. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2078) Support for global constants
[ https://issues.apache.org/jira/browse/SYSTEMML-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394613#comment-16394613 ] LI Guobao commented on SYSTEMML-2078: - ok. Thanks [~mboehm7] . I will try to work on this issue by following your hints. > Support for global constants > > > Key: SYSTEMML-2078 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2078 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Priority: Major > > Similar to R, where variables of the surrounding context are accessible, this > task aims to introduce global constant variables. Furthermore, we should also > add builtin constants such as NaN, INF, and PI. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2197) Multi-threaded broadcast creation
[ https://issues.apache.org/jira/browse/SYSTEMML-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418050#comment-16418050 ] LI Guobao commented on SYSTEMML-2197: - Hi [~mboehm7], could you give me some more details on this issue? Thanks > Multi-threaded broadcast creation > - > > Key: SYSTEMML-2197 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2197 > Project: SystemML > Issue Type: Task >Reporter: Matthias Boehm >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2197) Multi-threaded broadcast creation
[ https://issues.apache.org/jira/browse/SYSTEMML-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419113#comment-16419113 ] LI Guobao commented on SYSTEMML-2197: - Thanks [~mboehm7] for the details. And I want to know which test should be launched for it? Thanks. > Multi-threaded broadcast creation > - > > Key: SYSTEMML-2197 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2197 > Project: SystemML > Issue Type: Task >Reporter: Matthias Boehm >Priority: Major > > All spark instructions that broadcast one of the input operands, rely on a > shared primitive {{sec.getBroadcastForVariable(var)}} for creating > partitioned broadcasts, which are wrapper objects around potentially many > broadcast variables to overcome Spark 2GB limitation for compressed > broadcasts. Each individual broadcast blocks the matrix into squared blocks > for direct access without unnecessary copy per task. So far this broadcast > creation is single-threaded. > This task aims to parallelize the blocking of the given in-memory matrix into > squared blocks > (https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/data/PartitionedBlock.java#L82) > as well as the subsequent partition creation and actual broadcasting > (https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/controlprogram/context/SparkExecutionContext.java#L548). > > For consistency and in order to avoid excessive over-provisioning, this > multi-threading should use the common internal thread pool or parallel java > streams, which similarly calls the shared {{ForkJoinPool.commonPool}}. An > example is the multi-threaded parallelization of RDDs which similarly blocks > a given matrix into its squared blocks (see > https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/controlprogram/context/SparkExecutionContext.java#L679). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (SYSTEMML-2197) Multi-threaded broadcast creation
[ https://issues.apache.org/jira/browse/SYSTEMML-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao reassigned SYSTEMML-2197: --- Assignee: LI Guobao > Multi-threaded broadcast creation > - > > Key: SYSTEMML-2197 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2197 > Project: SystemML > Issue Type: Task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > All spark instructions that broadcast one of the input operands, rely on a > shared primitive {{sec.getBroadcastForVariable(var)}} for creating > partitioned broadcasts, which are wrapper objects around potentially many > broadcast variables to overcome Spark 2GB limitation for compressed > broadcasts. Each individual broadcast blocks the matrix into squared blocks > for direct access without unnecessary copy per task. So far this broadcast > creation is single-threaded. > This task aims to parallelize the blocking of the given in-memory matrix into > squared blocks > (https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/data/PartitionedBlock.java#L82) > as well as the subsequent partition creation and actual broadcasting > (https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/controlprogram/context/SparkExecutionContext.java#L548). > > For consistency and in order to avoid excessive over-provisioning, this > multi-threading should use the common internal thread pool or parallel java > streams, which similarly calls the shared {{ForkJoinPool.commonPool}}. An > example is the multi-threaded parallelization of RDDs which similarly blocks > a given matrix into its squared blocks (see > https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/controlprogram/context/SparkExecutionContext.java#L679). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (SYSTEMML-2077) New eval builtin function
[ https://issues.apache.org/jira/browse/SYSTEMML-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao reassigned SYSTEMML-2077: --- Assignee: LI Guobao > New eval builtin function > - > > Key: SYSTEMML-2077 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2077 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This new eval builtin function aims to provide a concise language construct > to evaluate dynamic expressions and functions. > Similar to R's eval function > (https://stat.ethz.ch/R-manual/R-devel/library/base/html/eval.html), this > would allow us to evaluate dynamically constructed expressions. There are two > major sub tasks here: the invocation of given function pointers and the > evaluation of dynamic expressions given as strings. Initially, we would focus > on the former by allowing call such as {{R = eval(fname, A, B, C)}}. So far > SystemML does not provide second-order functions which requires explicit > {{if-else}} conditions for ensemble learning workloads. With this new > {{eval}} function we could store a list of function names in a frame {{F}} > and dynamically call them via {{R[i, ] = eval(F[i,1], A, B, C)}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2232) Logical namespace handling user-defined functions
[ https://issues.apache.org/jira/browse/SYSTEMML-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426205#comment-16426205 ] LI Guobao commented on SYSTEMML-2232: - Hi [~mboehm7], As I understand, we will keep the logical name instead of converting it to real filename? > Logical namespace handling user-defined functions > - > > Key: SYSTEMML-2232 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2232 > Project: SystemML > Issue Type: Task >Reporter: Matthias Boehm >Priority: Major > > At script level functions might have logical namespace names such as > {{foo::bar()}}, where foo is the namespace name, and bar is the function > name. To handle namespace conflicts, SYSTEMML-631 internally replaced the > logical namespaces with filenames. For reasons such as improved statistics > output and the handling of namespace functions in the recently introduced > {{eval}} function (SYSTEMML-2077), it would be good to keep the logical > namespace as well. > This task aims to (1) extend the {{FunctionStatementBlock}} and > {{FunctionProgramBlock}} data structures to keep the logical namespace name, > (2) extend the parser and compiler accordingly, and (3) modify the statistics > maintenance to use the function key (i.e., concatenation of logical namespace > and function name) as the opcode. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2197) Multi-threaded broadcast creation
[ https://issues.apache.org/jira/browse/SYSTEMML-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426745#comment-16426745 ] LI Guobao commented on SYSTEMML-2197: - OK. I got an error when launching this test. [~mboehm7], could you help me out of this? I have setted the classpath to the systemml module. And the generated folders inside target can be also found. {code:java} 18/04/05 12:40:42 INFO api.DMLScript: END DML run 04/05/2018 12:40:42 starting R script cmd: Rscript --default-packages=methods,datasets,graphics,grDevices,stats,utils ./src/test/scripts/functions/binary/matrix_full_other/FullDistributedMatrixMultiplication.R target/testTemp/functions/binary/matrix_full_other/FullDistributedMatrixMultiplicationTest/in/ target/testTemp/functions/binary/matrix_full_other/FullDistributedMatrixMultiplicationTest/expected/0.7_0.1/ java.io.IOException: Cannot run program "Rscript": error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at java.lang.Runtime.exec(Runtime.java:620) at java.lang.Runtime.exec(Runtime.java:450) at java.lang.Runtime.exec(Runtime.java:347) at org.apache.sysml.test.integration.AutomatedTestBase.runRScript(AutomatedTestBase.java:990) at org.apache.sysml.test.integration.functions.binary.matrix_full_other.FullDistributedMatrixMultiplicationTest.runDistributedMatrixMatrixMultiplicationTest(FullDistributedMatrixMultiplicationTest.java:277) at org.apache.sysml.test.integration.functions.binary.matrix_full_other.FullDistributedMatrixMultiplicationTest.testDenseSparseRmmSpark(FullDistributedMatrixMultiplicationTest.java:209) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.junit.runner.JUnitCore.run(JUnitCore.java:160) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) Caused by: java.io.IOException: error=2, No such file or directory at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.(UNIXProcess.java:247) at java.lang.ProcessImpl.start(ProcessImpl.java:134) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ... 32 more {code} > Multi-threaded broadcast creation > - > > Key: SYSTEMML-2197 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2197 > Project: SystemML > Issue Type: Task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > All spark instructions that broadcast one of the input operands, rely on a > shared primitive {{sec.getBroadcastForVariable(var)}} for creating > partitioned broadcasts, which are wrapper objects around potentially many > broadcast variables to overcome Spark 2GB limitation for compressed > broadcasts. Each individual broadcast blocks the matrix into squared blocks > for direct access without unnecessary copy per task. So far this broadcast > creation is single-threaded. > This task aims to parallelize the blocking of the given in-memory matrix into > squared blocks > (https:
[jira] [Commented] (SYSTEMML-2197) Multi-threaded broadcast creation
[ https://issues.apache.org/jira/browse/SYSTEMML-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428709#comment-16428709 ] LI Guobao commented on SYSTEMML-2197: - Thanks, I successfully launched this test and got 4 failed tests(testDenseDenseMapmmMR, testDenseSparseMapmmMR, testSparseDenseMapmmMR, testSparseSparseMapmmMR). And it seems to concern the MR backend. When reswitching to the master branch, I got the same result. So is it a detected bug? Could I ignore it? Thanks for the response. Here is the stack information: {code:java} 18/04/06 19:38:48 ERROR api.DMLScript: Failed to execute DML script. org.apache.sysml.runtime.DMLRuntimeException: org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in program block generated from statement block between lines 23 and 27 -- Error evaluating instruction: jobtype = GMR input labels = [_mVar13, _mVar14] recReader inst = rand inst = mapper inst = MR°mapmm°0·MATRIX·DOUBLE°1·MATRIX·DOUBLE°2·MATRIX·DOUBLE°RIGHT°false shuffle inst = agg inst = MR°ak+°2·MATRIX·DOUBLE°3·MATRIX·DOUBLE°true°NONE other inst = output labels = [pVar15] result indices = ,3 num reducers = 10 replication = 1 at org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:123) at org.apache.sysml.api.ScriptExecutorUtils.executeRuntimeProgram(ScriptExecutorUtils.java:97) at org.apache.sysml.api.DMLScript.execute(DMLScript.java:744) at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:515) at org.apache.sysml.api.DMLScript.main(DMLScript.java:246) at org.apache.sysml.test.integration.AutomatedTestBase.runTest(AutomatedTestBase.java:1214) at org.apache.sysml.test.integration.functions.binary.matrix_full_other.FullDistributedMatrixMultiplicationTest.runDistributedMatrixMatrixMultiplicationTest(FullDistributedMatrixMultiplicationTest.java:276) at org.apache.sysml.test.integration.functions.binary.matrix_full_other.FullDistributedMatrixMultiplicationTest.testSparseSparseMapmmMR(FullDistributedMatrixMultiplicationTest.java:101){code} > Multi-threaded broadcast creation > - > > Key: SYSTEMML-2197 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2197 > Project: SystemML > Issue Type: Task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > All spark instructions that broadcast one of the input operands, rely on a > shared primitive {{sec.getBroadcastForVariable(var)}} for creating > partitioned broadcasts, which are wrapper objects around potentially many > broadcast variables to overcome Spark 2GB limitation for compressed > broadcasts. Each individual broadcast blocks the matrix into squared blocks > for direct access without unnecessary copy per task. So far this broadcast > creation is single-threaded. > This task aims to parallelize the blocking of the given in-memory matrix into > squared blocks > (https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/data/PartitionedBlock.java#L82) > as well as the subsequent partition creation and actual broadcasting > (https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/controlprogram/context/SparkExecutionContext.java#L548). > > For consistency and in order to avoid excessive over-provisioning, this > multi-threading should use the common internal thread pool or parallel java > streams, which similarly calls the shared {{ForkJoinPool.commonPool}}. An > example is the multi-threaded parallelization of RDDs which similarly blocks > a given matrix into its squared blocks (see > https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/controlprogram/context/SparkExecutionContext.java#L679). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-1313) Parfor broadcast exploitation
[ https://issues.apache.org/jira/browse/SYSTEMML-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431179#comment-16431179 ] LI Guobao commented on SYSTEMML-1313: - Hi [~mboehm7], could you give me more details about this issue? > Parfor broadcast exploitation > - > > Key: SYSTEMML-1313 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1313 > Project: SystemML > Issue Type: Sub-task > Components: APIs, Runtime >Reporter: Matthias Boehm >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (SYSTEMML-1313) Parfor broadcast exploitation
[ https://issues.apache.org/jira/browse/SYSTEMML-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao reassigned SYSTEMML-1313: --- Assignee: LI Guobao > Parfor broadcast exploitation > - > > Key: SYSTEMML-1313 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1313 > Project: SystemML > Issue Type: Sub-task > Components: APIs, Runtime >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > The parfor optimizer may decide to execute the entire loop as a remote Spark > job to utilize cluster parallelism. In this case all inputs to the parfor > body (i.e., variable that are created or read outside of the parfor body but > used or overwritten inside) are read from HDFS. In the past there was an > issue of redundant reads, which has been addressed with SYSTEMML-1879. > However, the direct use of Spark broadcast variables would likely improve > performance, especially in clusters with many nodes. > This task aims to leverage Spark broadcast variables for all parfor inputs. > In detail this entails two major aspects. First, we need runtime support to > optionally broadcast the inputs via broadcast variables in > {{RemoteParForSpark}} and obtain them from these broadcast variables in > {{RemoteParForSparkWorker}} without causing unnecessary eviction. In > contrast, to the existing broadcast primitives, we don't need to blockify the > matrix because the matrix is accessed in full by in-memory operations. > Second, this requires an extension of the parfor optimizer to reason about > scenarios where it is safe to use broadcast because these broadcasts cause > additional memory requirements since they act as pinned in memory matrices. > This second task has likely overlap with SYSTEMML-1349 which requires a > similar reasoning to handle shared reads. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-1313) Parfor broadcast exploitation
[ https://issues.apache.org/jira/browse/SYSTEMML-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16445646#comment-16445646 ] LI Guobao commented on SYSTEMML-1313: - Hi [~mboehm7], I have introduce the runtime support and I'd like to run a test for it. And which test can I take? Should I install a yarn cluster so that the parfor can be executed in mode `remote_spark`? > Parfor broadcast exploitation > - > > Key: SYSTEMML-1313 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1313 > Project: SystemML > Issue Type: Sub-task > Components: APIs, Runtime >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > The parfor optimizer may decide to execute the entire loop as a remote Spark > job to utilize cluster parallelism. In this case all inputs to the parfor > body (i.e., variable that are created or read outside of the parfor body but > used or overwritten inside) are read from HDFS. In the past there was an > issue of redundant reads, which has been addressed with SYSTEMML-1879. > However, the direct use of Spark broadcast variables would likely improve > performance, especially in clusters with many nodes. > This task aims to leverage Spark broadcast variables for all parfor inputs. > In detail this entails two major aspects. First, we need runtime support to > optionally broadcast the inputs via broadcast variables in > {{RemoteParForSpark}} and obtain them from these broadcast variables in > {{RemoteParForSparkWorker}} without causing unnecessary eviction. In > contrast, to the existing broadcast primitives, we don't need to blockify the > matrix because the matrix is accessed in full by in-memory operations. > Second, this requires an extension of the parfor optimizer to reason about > scenarios where it is safe to use broadcast because these broadcasts cause > additional memory requirements since they act as pinned in memory matrices. > This second task has likely overlap with SYSTEMML-1349 which requires a > similar reasoning to handle shared reads. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (SYSTEMML-1313) Parfor broadcast exploitation
[ https://issues.apache.org/jira/browse/SYSTEMML-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-1313: Comment: was deleted (was: Hi [~mboehm7], I have introduce the runtime support and I'd like to run a test for it. And which test can I take? Should I install a yarn cluster so that the parfor can be executed in mode `remote_spark`?) > Parfor broadcast exploitation > - > > Key: SYSTEMML-1313 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1313 > Project: SystemML > Issue Type: Sub-task > Components: APIs, Runtime >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > The parfor optimizer may decide to execute the entire loop as a remote Spark > job to utilize cluster parallelism. In this case all inputs to the parfor > body (i.e., variable that are created or read outside of the parfor body but > used or overwritten inside) are read from HDFS. In the past there was an > issue of redundant reads, which has been addressed with SYSTEMML-1879. > However, the direct use of Spark broadcast variables would likely improve > performance, especially in clusters with many nodes. > This task aims to leverage Spark broadcast variables for all parfor inputs. > In detail this entails two major aspects. First, we need runtime support to > optionally broadcast the inputs via broadcast variables in > {{RemoteParForSpark}} and obtain them from these broadcast variables in > {{RemoteParForSparkWorker}} without causing unnecessary eviction. In > contrast, to the existing broadcast primitives, we don't need to blockify the > matrix because the matrix is accessed in full by in-memory operations. > Second, this requires an extension of the parfor optimizer to reason about > scenarios where it is safe to use broadcast because these broadcasts cause > additional memory requirements since they act as pinned in memory matrices. > This second task has likely overlap with SYSTEMML-1349 which requires a > similar reasoning to handle shared reads. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2298) Create a test dml script based on NN library
LI Guobao created SYSTEMML-2298: --- Summary: Create a test dml script based on NN library Key: SYSTEMML-2298 URL: https://issues.apache.org/jira/browse/SYSTEMML-2298 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (SYSTEMML-2298) Create a test dml script based on NN library
[ https://issues.apache.org/jira/browse/SYSTEMML-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao reassigned SYSTEMML-2298: --- Assignee: LI Guobao > Create a test dml script based on NN library > > > Key: SYSTEMML-2298 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2298 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2298) Creation of a test dml script based on NN library
[ https://issues.apache.org/jira/browse/SYSTEMML-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2298: Summary: Creation of a test dml script based on NN library (was: Create a test dml script based on NN library) > Creation of a test dml script based on NN library > - > > Key: SYSTEMML-2298 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2298 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2298) Creation of a test dml script based on NN library
[ https://issues.apache.org/jira/browse/SYSTEMML-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2298: Description: During the bonding time, all the development environment should be well prepared. And a test dml script which leverages the new "paramserve" function to rewrite the training function in the [MNIST LeNet Example|https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet.dml] could be prepared. > Creation of a test dml script based on NN library > - > > Key: SYSTEMML-2298 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2298 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > During the bonding time, all the development environment should be well > prepared. And a test dml script which leverages the new "paramserve" function > to rewrite the training function in the [MNIST LeNet > Example|https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet.dml] > could be prepared. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2298) Creation of a test dml script based on NN library
[ https://issues.apache.org/jira/browse/SYSTEMML-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2298: Description: During the bonding time, all the development environment should be well prepared. And a test dml script which leverages the new "paramserv" function to rewrite the training function in the [MNIST LeNet Example|https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet.dml] could be prepared. (was: During the bonding time, all the development environment should be well prepared. And a test dml script which leverages the new "paramserve" function to rewrite the training function in the [MNIST LeNet Example|https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet.dml] could be prepared.) > Creation of a test dml script based on NN library > - > > Key: SYSTEMML-2298 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2298 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > During the bonding time, all the development environment should be well > prepared. And a test dml script which leverages the new "paramserv" function > to rewrite the training function in the [MNIST LeNet > Example|https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet.dml] > could be prepared. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2299) API design of the paramserv function
LI Guobao created SYSTEMML-2299: --- Summary: API design of the paramserv function Key: SYSTEMML-2299 URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2084) Language and compiler extension
[ https://issues.apache.org/jira/browse/SYSTEMML-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2084: Due Date: 28/May/18 > Language and compiler extension > --- > > Key: SYSTEMML-2084 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2084 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2298) Creation of a test dml script based on NN library
[ https://issues.apache.org/jira/browse/SYSTEMML-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2298: Due Date: 14/May/18 > Creation of a test dml script based on NN library > - > > Key: SYSTEMML-2298 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2298 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > During the bonding time, all the development environment should be well > prepared. And a test dml script which leverages the new "paramserv" function > to rewrite the training function in the [MNIST LeNet > Example|https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet.dml] > could be prepared. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2084) Implementation of language and compiler extension
[ https://issues.apache.org/jira/browse/SYSTEMML-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2084: Summary: Implementation of language and compiler extension (was: Language and compiler extension) > Implementation of language and compiler extension > - > > Key: SYSTEMML-2084 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2084 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (SYSTEMML-2084) Implementation of language and compiler extension
[ https://issues.apache.org/jira/browse/SYSTEMML-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao reassigned SYSTEMML-2084: --- Assignee: LI Guobao > Implementation of language and compiler extension > - > > Key: SYSTEMML-2084 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2084 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao reassigned SYSTEMML-2299: --- Assignee: LI Guobao > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Due Date: 4/Jun/18 Summary: Single-node parameter server primitives (was: Basic runtime primitives) > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao reassigned SYSTEMML-2085: --- Assignee: LI Guobao > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2300) First evaluation
LI Guobao created SYSTEMML-2300: --- Summary: First evaluation Key: SYSTEMML-2300 URL: https://issues.apache.org/jira/browse/SYSTEMML-2300 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (SYSTEMML-2086) Initial version of local backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao reassigned SYSTEMML-2086: --- Assignee: LI Guobao > Initial version of local backend > > > Key: SYSTEMML-2086 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2086 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2086) Initial version of local backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2086: Summary: Initial version of local backend (was: Local, multi-threaded backend) > Initial version of local backend > > > Key: SYSTEMML-2086 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2086 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2086) Initial version of local backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2086: Due Date: 25/Jun/18 > Initial version of local backend > > > Key: SYSTEMML-2086 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2086 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (SYSTEMML-2087) Initial version of distributed spark backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao reassigned SYSTEMML-2087: --- Assignee: LI Guobao Due Date: 9/Jul/18 Summary: Initial version of distributed spark backend (was: Distributed spark backend) > Initial version of distributed spark backend > > > Key: SYSTEMML-2087 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2087 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2301) Second evaluation
LI Guobao created SYSTEMML-2301: --- Summary: Second evaluation Key: SYSTEMML-2301 URL: https://issues.apache.org/jira/browse/SYSTEMML-2301 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2302) Second version of execution backend
LI Guobao created SYSTEMML-2302: --- Summary: Second version of execution backend Key: SYSTEMML-2302 URL: https://issues.apache.org/jira/browse/SYSTEMML-2302 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2303) Integration test, implementation of samples and documentation
LI Guobao created SYSTEMML-2303: --- Summary: Integration test, implementation of samples and documentation Key: SYSTEMML-2303 URL: https://issues.apache.org/jira/browse/SYSTEMML-2303 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2304) Submit final product
LI Guobao created SYSTEMML-2304: --- Summary: Submit final product Key: SYSTEMML-2304 URL: https://issues.apache.org/jira/browse/SYSTEMML-2304 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (SYSTEMML-2083) Language and runtime for parameter servers
[ https://issues.apache.org/jira/browse/SYSTEMML-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao reassigned SYSTEMML-2083: --- Assignee: LI Guobao > Language and runtime for parameter servers > -- > > Key: SYSTEMML-2083 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2083 > Project: SystemML > Issue Type: Epic >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Labels: gsoc2018 > Attachments: image-2018-02-14-12-18-48-932.png, > image-2018-02-14-12-21-00-932.png, image-2018-02-14-12-31-37-563.png > > > SystemML already provides a rich set of execution strategies ranging from > local operations to large-scale computation on MapReduce or Spark. In this > context, we support both data-parallel (multi-threaded or distributed > operations) as well as task-parallel computation (multi-threaded or > distributed parfor loops). This epic aims to complement the existing > execution strategies by language and runtime primitives for parameter > servers, i.e., model-parallel execution. We use the terminology of > model-parallel execution with distributed data and distributed model to > differentiate them from the existing data-parallel operations. Target > applications are distributed deep learning and mini-batch algorithms in > general. These new abstractions will help making SystemML a unified framework > for small- and large-scale machine learning that supports all three major > execution strategies in a single framework. > > A major challenge is the integration of stateful parameter servers and their > common push/pull primitives into an otherwise functional (and thus, > stateless) language. We will approach this challenge via a new builtin > function {{paramserv}} which internally maintains state but at the same time > fits into the runtime framework of stateless operations. > Furthermore, we are interested in providing (1) different runtime backends > (local and distributed), (2) different parameter server modes (synchronous, > asynchronous, hogwild!, stale-synchronous), (3) different update frequencies > (batch, multi-batch, epoch), as well as (4) different architectures for > distributed data (1 parameter server, k workers) and distributed model (k1 > parameter servers, k2 workers). > > *Note for GSOC students:* This is large project which will be broken down > into sub projects, so everybody will be having their share of pie. > *Prerequistes:* Java, machine learning experience is a plus but not required. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature is illustrated in Figure 1. We are interested in providing the model, the training features and labels, the validation features and labels, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or batch), the aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature is > illustrated in Figure 1. We are interested in providing the model, the > training features and labels, the validation features and labels, the batch > update function, the update strategy (e.g. sync, async, hogwild!, > stale-synchronous), the update frequency (e.g. epoch or batch), the > aggregation function, the number of epoch, the batch size, the degree of > parallelism as well as the checkpointing strategy (e.g. rollback recovery). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature is illustrated in Figure 1. We are interested in providing the model, the training features and labels, the validation features and labels, the gradient calculation function, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or batch), the aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). (was: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature is illustrated in Figure 1. We are interested in providing the model, the training features and labels, the validation features and labels, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or batch), the aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery).) > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature is > illustrated in Figure 1. We are interested in providing the model, the > training features and labels, the validation features and labels, the > gradient calculation function, the batch update function, the update strategy > (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. > epoch or batch), the aggregation function, the number of epoch, the batch > size, the degree of parallelism as well as the checkpointing strategy (e.g. > rollback recovery). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be _model'=paramserv(model, X, y, X_val, y_val, g_cal_fun, upd=fun1, mode=SYNC, freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are interested in providing the model, the training features and labels, the validation features and labels, the gradient calculation function, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or batch), the aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). (was: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature is illustrated in Figure 1. We are interested in providing the model, the training features and labels, the validation features and labels, the gradient calculation function, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or batch), the aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery).) > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be > _model'=paramserv(model, X, y, X_val, y_val, g_cal_fun, upd=fun1, mode=SYNC, > freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, > checkpointing=rollback)_. We are interested in providing the model, the > training features and labels, the validation features and labels, the > gradient calculation function, the batch update function, the update strategy > (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. > epoch or batch), the aggregation function, the number of epoch, the batch > size, the degree of parallelism as well as the checkpointing strategy (e.g. > rollback recovery). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2084) Implementation of language and compiler extension
[ https://issues.apache.org/jira/browse/SYSTEMML-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2084: Description: This part aims to add an additional language support for the “paramserv” function in order to be able to compile this new function. Since SystemML already supports the parameterized builtin function, we can easily extend an additional operation type and generate a new instruction for the “paramserv” function. Recently, we have also added a new “eval” built-in function which is capable to pass a function pointer as argument so that it can be called in runtime. Similar to it, we would need to extend the inter-procedural analysis to avoid removing unused constructed functions in the presence of second-order “paramserv” function. Because the referenced functions, i.e., the aggregate function and update function, should be present in runtime. > Implementation of language and compiler extension > - > > Key: SYSTEMML-2084 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2084 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to add an additional language support for the “paramserv” > function in order to be able to compile this new function. Since SystemML > already supports the parameterized builtin function, we can easily extend an > additional operation type and generate a new instruction for the “paramserv” > function. Recently, we have also added a new “eval” built-in function which > is capable to pass a function pointer as argument so that it can be called in > runtime. Similar to it, we would need to extend the inter-procedural analysis > to avoid removing unused constructed functions in the presence of > second-order “paramserv” function. Because the referenced functions, i.e., > the aggregate function and update function, should be present in runtime. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Description: Parameter server allows to persist the model parameters in a distributed manner. It is specially applied in the context of large-scale machine learning to train the model. The parameters computation will be done with data parallelism across the workers. The data-parallel parameter server architecture is illustrated in Figure 2. With the help of a lightweight parameter server interface [1], we are inspired to provide the push and pull methods as internal primitives, i.e., not exposed to the script level, allowing to exchange the intermediates among workers. > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > Parameter server allows to persist the model parameters in a distributed > manner. It is specially applied in the context of large-scale machine > learning to train the model. The parameters computation will be done with > data parallelism across the workers. The data-parallel parameter server > architecture is illustrated in Figure 2. With the help > of a lightweight parameter server interface [1], we are inspired to provide > the push and pull methods as internal primitives, i.e., not exposed to the > script level, allowing to exchange the intermediates among workers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Description: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. (was: Parameter server allows to persist the model parameters in a distributed manner. It is specially applied in the context of large-scale machine learning to train the model. The parameters computation will be done with data parallelism across the workers. The data-parallel parameter server architecture is illustrated in Figure 2. With the help of a lightweight parameter server interface [1], we are inspired to provide the push and pull methods as internal primitives, i.e., not exposed to the script level, allowing to exchange the intermediates among workers.) > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Description: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. The idea is to run a single-node parameter server by maintaining a hashmap inside the CP (Control Program) where the parameter as value accompanied with a defined key. For example, inserting the global parameter with a key named “worker-param-replica” allows the workers to retrieve the parameter replica. Hence, in the context of local multi-threaded backend, workers can communicate directly with this hashmap in the same process. And in the context of Spark distributed backend, the CP firstly needs to fork a thread to start a parameter server which maintains a hashmap. And secondly the workers can send intermediates and retrieve parameters by connecting to parameter server via TCP socket. Since SystemML has good cache management, we only need to maintain the matrix reference pointing to a file location instead of real data instance in the hashmap. If time permits, to be able to introduce the async and staleness update strategies, we would need to implement the synchronization by leveraging vector clock. (was: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. ) > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. The idea is to run a single-node parameter server by maintaining a > hashmap inside the CP (Control Program) where the parameter as value > accompanied with a defined key. For example, inserting the global parameter > with a key named “worker-param-replica” allows the workers to retrieve the > parameter replica. Hence, in the context of local multi-threaded backend, > workers can communicate directly with this hashmap in the same process. And > in the context of Spark distributed backend, the CP firstly needs to fork a > thread to start a parameter server which maintains a hashmap. And secondly > the workers can send intermediates and retrieve parameters by connecting to > parameter server via TCP socket. Since SystemML has good cache management, we > only need to maintain the matrix reference pointing to a file location > instead of real data instance in the hashmap. If time permits, to be able to > introduce the async and staleness update strategies, we would need to > implement the synchronization by leveraging vector clock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Attachment: ps.png > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. The idea is to run a single-node parameter server by maintaining a > hashmap inside the CP (Control Program) where the parameter as value > accompanied with a defined key. For example, inserting the global parameter > with a key named “worker-param-replica” allows the workers to retrieve the > parameter replica. Hence, in the context of local multi-threaded backend, > workers can communicate directly with this hashmap in the same process. And > in the context of Spark distributed backend, the CP firstly needs to fork a > thread to start a parameter server which maintains a hashmap. And secondly > the workers can send intermediates and retrieve parameters by connecting to > parameter server via TCP socket. Since SystemML has good cache management, we > only need to maintain the matrix reference pointing to a file location > instead of real data instance in the hashmap. If time permits, to be able to > introduce the async and staleness update strategies, we would need to > implement the synchronization by leveraging vector clock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are interested in providing the model (which will be a struct-like data structure consisting the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). And the function will return a trained model in format of struct. (was: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be _model'=paramserv(model, X, y, X_val, y_val, g_cal_fun, upd=fun1, mode=SYNC, freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are interested in providing the model, the training features and labels, the validation features and labels, the gradient calculation function, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or batch), the aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery).) > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be > _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, > agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are > interested in providing the model (which will be a struct-like data structure > consisting the weights, the biases and the hyperparameters), the training > features and labels, the validation features and labels, the batch update > function, the update strategy (e.g. sync, async, hogwild!, > stale-synchronous), the update frequency (e.g. epoch or mini-batch), the > gradient aggregation function, the number of epoch, the batch size, the > degree of parallelism as well as the checkpointing strategy (e.g. rollback > recovery). And the function will return a trained model in format of struct. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are interested in providing the model (which will be a struct-like data structure consisting the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). And the function will return a trained model in struct format. (was: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are interested in providing the model (which will be a struct-like data structure consisting the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). And the function will return a trained model in format of struct.) > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be > _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, > agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are > interested in providing the model (which will be a struct-like data structure > consisting the weights, the biases and the hyperparameters), the training > features and labels, the validation features and labels, the batch update > function, the update strategy (e.g. sync, async, hogwild!, > stale-synchronous), the update frequency (e.g. epoch or mini-batch), the > gradient aggregation function, the number of epoch, the batch size, the > degree of parallelism as well as the checkpointing strategy (e.g. rollback > recovery). And the function will return a trained model in struct format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2298) Preparation of dev environment
[ https://issues.apache.org/jira/browse/SYSTEMML-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2298: Summary: Preparation of dev environment (was: Creation of a test dml script based on NN library) > Preparation of dev environment > -- > > Key: SYSTEMML-2298 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2298 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > During the bonding time, all the development environment should be well > prepared. And a test dml script which leverages the new "paramserv" function > to rewrite the training function in the [MNIST LeNet > Example|https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet.dml] > could be prepared. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2298) Preparation of dev environment
[ https://issues.apache.org/jira/browse/SYSTEMML-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2298: Description: During the bonding time, all the development environment should be well prepared. The native library OpenBLAS should be installed in order to run the MNIST LeNet example. And then by leveraging the MNIST LeNet data generator ([http://leon.bottou.org/projects/infimnist]), we could generate 256k instances to train the model. (was: During the bonding time, all the development environment should be well prepared. And a test dml script which leverages the new "paramserv" function to rewrite the training function in the [MNIST LeNet Example|https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet.dml] could be prepared.) > Preparation of dev environment > -- > > Key: SYSTEMML-2298 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2298 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > During the bonding time, all the development environment should be well > prepared. The native library OpenBLAS should be installed in order to run the > MNIST LeNet example. And then by leveraging the MNIST LeNet data generator > ([http://leon.bottou.org/projects/infimnist]), we could generate 256k > instances to train the model. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2306) Implementation of a script with paramserv func
LI Guobao created SYSTEMML-2306: --- Summary: Implementation of a script with paramserv func Key: SYSTEMML-2306 URL: https://issues.apache.org/jira/browse/SYSTEMML-2306 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao This task aims to write a dml script consisting the paramserv function. We could easily reuse the MNIST LeNet example and adapt it by creating a struct-like model and passing the update function as well as the aggregation function. In this case, the update function which will be executed in workers should consist of calculating the gradients by walking the batch forward and backward steps. And the aggregation function which will be runned in parameter server should consist of updating the weights and biases by aggregating the received gradients. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Due Date: 17/May/18 (was: 21/May/18) > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be > _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, > agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are > interested in providing the model (which will be a struct-like data structure > consisting the weights, the biases and the hyperparameters), the training > features and labels, the validation features and labels, the batch update > function, the update strategy (e.g. sync, async, hogwild!, > stale-synchronous), the update frequency (e.g. epoch or mini-batch), the > gradient aggregation function, the number of epoch, the batch size, the > degree of parallelism as well as the checkpointing strategy (e.g. rollback > recovery). And the function will return a trained model in struct format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Due Date: 16/May/18 (was: 17/May/18) > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be > _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, > agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are > interested in providing the model (which will be a struct-like data structure > consisting the weights, the biases and the hyperparameters), the training > features and labels, the validation features and labels, the batch update > function, the update strategy (e.g. sync, async, hogwild!, > stale-synchronous), the update frequency (e.g. epoch or mini-batch), the > gradient aggregation function, the number of epoch, the batch size, the > degree of parallelism as well as the checkpointing strategy (e.g. rollback > recovery). And the function will return a trained model in struct format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2306) Implementation of a script with paramserv func
[ https://issues.apache.org/jira/browse/SYSTEMML-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2306: Due Date: 18/May/18 > Implementation of a script with paramserv func > -- > > Key: SYSTEMML-2306 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2306 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > This task aims to write a dml script consisting the paramserv function. We > could easily reuse the MNIST LeNet example and adapt it by creating a > struct-like model and passing the update function as well as the aggregation > function. In this case, the update function which will be executed in workers > should consist of calculating the gradients by walking the batch forward and > backward steps. And the aggregation function which will be runned in > parameter server should consist of updating the weights and biases by > aggregating the received gradients. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Description: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. # For the case of local multi-thread parameter server, it is easy to maintain a concurrent hashmap (where the parameters as value accompanied with a defined key) inside the CP. And the workers are launched in multi-threaded way to execute the gradients calculation function and push the gradients to the hashmap. An another thread will be launched to pull the gradients from hashmap and call the aggregation function to update the parameters. # For the case of spark distributed backend, we could launch a remote single parameter server outside of CP (as a worker) to provide the pull and push service. For the moment, all the weights and biases are saved in this single server. And the exchange between server and workers will be implemented by TCP. Hence, we could easily broadcast the IP address and the port number to the workers. And then the workers can send the gradients and retrieve the new parameters via TCP socket. We could also need to implement the synchronisation between workers and parameter server to be able to bring more parameter update strategies, e.g., the stale-synchronous strategy needs a hyperparameter "staleness" to define the waiting interval. The idea is to maintain a vector clock consisting of all workers' clock in the server. Each time when an iteration finishes, the worker will send a request to server and then the server will send back a response to indicate if the worker should wait or not. A diagram of the parameter server architecture is shown below. was: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. # For the case of local multi-thread parameter server, it is easy to maintain a concurrent hashmap (where the parameters as value accompanied with a defined key) inside the CP. And the workers are launched in multi-threaded way to execute the gradients calculation function and push the gradients to the hashmap. An another thread will be launched to pull the gradients from hashmap and call the aggregation function to update the parameters. # For the case of spark distributed backend, we could launch a remote single parameter server outside of CP (as a worker) to provide the pull and push service. For the moment, all the weights and biases are saved in this single server. And the exchange between server and workers will be implemented by TCP. Hence, we could easily broadcast the IP address and the port number to the workers. And then the workers can send the gradients and retrieve the new parameters via TCP socket. We could also need to implement the synchronisation between workers and parameter server to be able to bring more parameter update strategies, e.g., the stale-synchronous strategy needs a hyperparameter "staleness" to define the waiting interval. The idea is to maintain a vector clock consisting of all workers' clock in the server. Each time when an iteration finishes, the worker will send a request to server and then the server will send back a response to indicate if the worker should wait or not. > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. > # For the case of local multi-thread parameter server, it is easy to > maintain a concurrent hashmap (where the parameters as value accompanied with > a defined key) inside the CP. And the workers are launched in multi-threaded > way to execute the gradients calculation function and push the gradients to > the hashmap. An another thread will be launched to pull the gradients from > hashmap and call the aggregation function to update the parameters. > # For the case of spark distributed backend, we could launch a remote single > parameter server outside of CP (as a worker) to provide the pull and push > service. For the moment, all the weights and biases are saved in this single > server. And the exchange between server and workers will be implemented by > TCP. Hence, we could easily broadcast the IP address and the port number to > the workers. And then the workers can send the gradients and re
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Description: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. # For the case of local multi-thread parameter server, it is easy to maintain a concurrent hashmap (where the parameters as value accompanied with a defined key) inside the CP. And the workers are launched in multi-threaded way to execute the gradients calculation function and push the gradients to the hashmap. An another thread will be launched to pull the gradients from hashmap and call the aggregation function to update the parameters. # For the case of spark distributed backend, we could launch a remote single parameter server outside of CP (as a worker) to provide the pull and push service. For the moment, all the weights and biases are saved in this single server. And the exchange between server and workers will be implemented by TCP. Hence, we could easily broadcast the IP address and the port number to the workers. And then the workers can send the gradients and retrieve the new parameters via TCP socket. We could also need to implement the synchronisation between workers and parameter server to be able to bring more parameter update strategies, e.g., the stale-synchronous strategy needs a hyperparameter "staleness" to define the waiting interval. The idea is to maintain a vector clock consisting of all workers' clock in the server. Each time when an iteration finishes, the worker will send a request to server and then the server will send back a response to indicate if the worker should wait or not. was:A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. The idea is to run a single-node parameter server by maintaining a hashmap inside the CP (Control Program) where the parameter as value accompanied with a defined key. For example, inserting the global parameter with a key named “worker-param-replica” allows the workers to retrieve the parameter replica. Hence, in the context of local multi-threaded backend, workers can communicate directly with this hashmap in the same process. And in the context of Spark distributed backend, the CP firstly needs to fork a thread to start a parameter server which maintains a hashmap. And secondly the workers can send intermediates and retrieve parameters by connecting to parameter server via TCP socket. Since SystemML has good cache management, we only need to maintain the matrix reference pointing to a file location instead of real data instance in the hashmap. If time permits, to be able to introduce the async and staleness update strategies, we would need to implement the synchronization by leveraging vector clock. > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. > # For the case of local multi-thread parameter server, it is easy to > maintain a concurrent hashmap (where the parameters as value accompanied with > a defined key) inside the CP. And the workers are launched in multi-threaded > way to execute the gradients calculation function and push the gradients to > the hashmap. An another thread will be launched to pull the gradients from > hashmap and call the aggregation function to update the parameters. > # For the case of spark distributed backend, we could launch a remote single > parameter server outside of CP (as a worker) to provide the pull and push > service. For the moment, all the weights and biases are saved in this single > server. And the exchange between server and workers will be implemented by > TCP. Hence, we could easily broadcast the IP address and the port number to > the workers. And then the workers can send the gradients and retrieve the new > parameters via TCP socket. > We could also need to implement the synchronisation between workers and > parameter server to be able to bring more parameter update strategies, e.g., > the stale-synchronous strategy needs a hyperparameter "staleness" to define > the waiting interval. The idea is to maintain a vector clock consisting of > all workers' clock in the server. Each time when an iteration finishes, the > worker will send a request to server and then the
[jira] [Updated] (SYSTEMML-2086) Initial version of local backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2086: Description: This part aims to design and implement a local execution backend for the compiled “paramserv” function. It consists of the implementations of partitioning the data for worker threads, launching the single-node parameter server in CP, shipping and calling the compiled statistical function and creating different update strategies. We will focus on implementing BSP execution strategies, i.e., synchronous update strategy including per epoch and per batch. And other update strategies (e.g. asynchronous, stale-synchronous) and checkpointing strategies should be optional and will be added if time permits. The architecture for synchronous per epoch update strategy is illustrated below. > Initial version of local backend > > > Key: SYSTEMML-2086 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2086 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to design and implement a local execution backend for the > compiled “paramserv” function. It consists of the implementations of > partitioning the data for worker threads, launching the single-node parameter > server in CP, shipping and calling the compiled statistical function and > creating different update strategies. We will focus on > implementing BSP execution strategies, i.e., synchronous update strategy > including per epoch and per batch. And other update strategies (e.g. > asynchronous, stale-synchronous) and checkpointing strategies should be > optional and will be added if time permits. The architecture for synchronous > per epoch update strategy is illustrated below. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2086) Initial version of local backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2086: Description: This part aims to design and implement a local execution backend for the compiled “paramserv” function. It consists of the implementations of partitioning the data for worker threads, launching the single-node parameter server in CP, shipping and calling the compiled statistical function and creating different update strategies. We will focus on implementing BSP execution strategies, i.e., synchronous update strategy including per epoch and per batch. And other update strategies (e.g. asynchronous, stale-synchronous) and checkpointing strategies should be optional and will be added if time permits. The architecture for synchronous per epoch update strategy is illustrated below. The idea is to spawn a thread to launch local parameter server which is responsible for maintaining the parameter hashmap and executing the aggregation work. And then a number of workers will be forked according to the level of parallelism. The worker loads data partition, operates the parameter updating per batch, pushes the gradients and retrieves a new parameter from server. The server will retrieve the gradients of each worker using the related keys in a round robin way, aggregate the parameters and push the new global parameter with the parameter related keys. At last, the paramserv function main thread should wait for the server aggregator thread joining it and got the last global parameters as final result. Hence, the pull/push primitive methods can bring more flexibility and facilitate to implement other update strategies. was: This part aims to design and implement a local execution backend for the compiled “paramserv” function. It consists of the implementations of partitioning the data for worker threads, launching the single-node parameter server in CP, shipping and calling the compiled statistical function and creating different update strategies. We will focus on implementing BSP execution strategies, i.e., synchronous update strategy including per epoch and per batch. And other update strategies (e.g. asynchronous, stale-synchronous) and checkpointing strategies should be optional and will be added if time permits. The architecture for synchronous per epoch update strategy is illustrated below. > Initial version of local backend > > > Key: SYSTEMML-2086 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2086 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to design and implement a local execution backend for the > compiled “paramserv” function. It consists of the implementations of > partitioning the data for worker threads, launching the single-node parameter > server in CP, shipping and calling the compiled statistical function and > creating different update strategies. We will focus on > implementing BSP execution strategies, i.e., synchronous update strategy > including per epoch and per batch. And other update strategies (e.g. > asynchronous, stale-synchronous) and checkpointing strategies should be > optional and will be added if time permits. The architecture for synchronous > per epoch update strategy is illustrated below. > The idea is to spawn a thread to launch local parameter server which is > responsible for maintaining the parameter hashmap and executing the > aggregation work. And then a number of workers will be forked according to > the level of parallelism. The worker loads data partition, operates the > parameter updating per batch, pushes the gradients and retrieves a new > parameter from server. The server will retrieve the gradients of each worker > using the related keys in a round robin way, aggregate the parameters and > push the new global parameter with the parameter related keys. At last, the > paramserv function main thread should wait for the server aggregator thread > joining it and got the last global parameters as final result. Hence, the > pull/push primitive methods can bring more flexibility and facilitate to > implement other update strategies. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Description: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. Push/Pull service: In general, we could launch a parameter server inside (local multi-thread backend) or outside (spark distributed backend) of CP to provide the pull and push service. For the moment, all the weights and biases are saved in a hashmap using a key, e.g., "global parameter". Each worker's gradients will be put into the hashmap seperately with a given key. And the exchange between server and workers will be implemented by TCP. Hence, we could easily broadcast the IP address and the port number to the workers. And then the workers can send the gradients and retrieve the new parameters via TCP socket. The server will also spawn a thread which retrieves the gradients by polling the hashmap using relevant keys and aggregates them. At last, it updates the global parameter in the hashmap. Synchronization: We also need to implement the synchronization between workers and parameter server to be able to bring more parameter update strategies, e.g., the stale-synchronous strategy needs a hyperparameter "staleness" to define the waiting interval. The idea is to maintain a vector clock recording all workers' clock in the server. Each time when an iteration in side of worker finishes, it waits server to give a signal, i.e., to send a request for calculating the staleness according to the vector clock. And when the server receives the gradients from certain worker, it will increment the vector clock for this worker. So we could define BSP as "staleness==0", ASP as "staleness==-1" and SSP as "staleness==N". A diagram of the parameter server architecture is shown below. was: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. # For the case of local multi-thread parameter server, it is easy to maintain a concurrent hashmap (where the parameters as value accompanied with a defined key) inside the CP. And the workers are launched in multi-threaded way to execute the gradients calculation function and push the gradients to the hashmap. An another thread will be launched to pull the gradients from hashmap and call the aggregation function to update the parameters. # For the case of spark distributed backend, we could launch a remote single parameter server outside of CP (as a worker) to provide the pull and push service. For the moment, all the weights and biases are saved in this single server. And the exchange between server and workers will be implemented by TCP. Hence, we could easily broadcast the IP address and the port number to the workers. And then the workers can send the gradients and retrieve the new parameters via TCP socket. We could also need to implement the synchronisation between workers and parameter server to be able to bring more parameter update strategies, e.g., the stale-synchronous strategy needs a hyperparameter "staleness" to define the waiting interval. The idea is to maintain a vector clock consisting of all workers' clock in the server. Each time when an iteration finishes, the worker will send a request to server and then the server will send back a response to indicate if the worker should wait or not. A diagram of the parameter server architecture is shown below. > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. > Push/Pull service: > In general, we could launch a parameter server inside (local multi-thread > backend) or outside (spark distributed backend) of CP to provide the pull and > push service. For the moment, all the weights and biases are saved in a > hashmap using a key, e.g., "global parameter". Each worker's gradients will > be put into the hashmap seperately with a given key. And the exchange between > server and workers will be implemented by TCP. Hence, we could easily > broadcast the IP address and the port number to the workers. And then the > workers can send the gradients and retrieve the new parameters via TCP > socket. The server will also spawn a thread which retrieves the gradients by > polling the hashmap using relevant k
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Attachment: ps.png > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. > Push/Pull service: > In general, we could launch a parameter server inside (local multi-thread > backend) or outside (spark distributed backend) of CP to provide the pull and > push service. For the moment, all the weights and biases are saved in a > hashmap using a key, e.g., "global parameter". Each worker's gradients will > be put into the hashmap seperately with a given key. And the exchange between > server and workers will be implemented by TCP. Hence, we could easily > broadcast the IP address and the port number to the workers. And then the > workers can send the gradients and retrieve the new parameters via TCP > socket. The server will also spawn a thread which retrieves the gradients by > polling the hashmap using relevant keys and aggregates them. At last, it > updates the global parameter in the hashmap. > Synchronization: > We also need to implement the synchronization between workers and parameter > server to be able to bring more parameter update strategies, e.g., the > stale-synchronous strategy needs a hyperparameter "staleness" to define the > waiting interval. The idea is to maintain a vector clock recording all > workers' clock in the server. Each time when an iteration in side of worker > finishes, it waits server to give a signal, i.e., to send a request for > calculating the staleness according to the vector clock. And when the server > receives the gradients from certain worker, it will increment the vector > clock for this worker. So we could define BSP as "staleness==0", ASP as > "staleness==-1" and SSP as "staleness==N". > A diagram of the parameter server architecture is shown below. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Attachment: (was: ps.png) > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. > Push/Pull service: > In general, we could launch a parameter server inside (local multi-thread > backend) or outside (spark distributed backend) of CP to provide the pull and > push service. For the moment, all the weights and biases are saved in a > hashmap using a key, e.g., "global parameter". Each worker's gradients will > be put into the hashmap seperately with a given key. And the exchange between > server and workers will be implemented by TCP. Hence, we could easily > broadcast the IP address and the port number to the workers. And then the > workers can send the gradients and retrieve the new parameters via TCP > socket. The server will also spawn a thread which retrieves the gradients by > polling the hashmap using relevant keys and aggregates them. At last, it > updates the global parameter in the hashmap. > Synchronization: > We also need to implement the synchronization between workers and parameter > server to be able to bring more parameter update strategies, e.g., the > stale-synchronous strategy needs a hyperparameter "staleness" to define the > waiting interval. The idea is to maintain a vector clock recording all > workers' clock in the server. Each time when an iteration in side of worker > finishes, it waits server to give a signal, i.e., to send a request for > calculating the staleness according to the vector clock. And when the server > receives the gradients from certain worker, it will increment the vector > clock for this worker. So we could define BSP as "staleness==0", ASP as > "staleness==-1" and SSP as "staleness==N". > A diagram of the parameter server architecture is shown below. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2086) Initial version of local backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2086: Description: This part aims to design and implement a local execution backend for the compiled “paramserv” function. The idea is to spawn a thread in CP for running the parameter server. And the workers are also launched in multi-threaded way in CP. (was: This part aims to design and implement a local execution backend for the compiled “paramserv” function. It consists of the implementations of partitioning the data for worker threads, launching the single-node parameter server in CP, shipping and calling the compiled statistical function and creating different update strategies. We will focus on implementing BSP execution strategies, i.e., synchronous update strategy including per epoch and per batch. And other update strategies (e.g. asynchronous, stale-synchronous) and checkpointing strategies should be optional and will be added if time permits. The architecture for synchronous per epoch update strategy is illustrated below. The idea is to spawn a thread to launch local parameter server which is responsible for maintaining the parameter hashmap and executing the aggregation work. And then a number of workers will be forked according to the level of parallelism. The worker loads data partition, operates the parameter updating per batch, pushes the gradients and retrieves a new parameter from server. The server will retrieve the gradients of each worker using the related keys in a round robin way, aggregate the parameters and push the new global parameter with the parameter related keys. At last, the paramserv function main thread should wait for the server aggregator thread joining it and got the last global parameters as final result. Hence, the pull/push primitive methods can bring more flexibility and facilitate to implement other update strategies.) > Initial version of local backend > > > Key: SYSTEMML-2086 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2086 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to design and implement a local execution backend for the > compiled “paramserv” function. The idea is to spawn a thread in CP for > running the parameter server. And the workers are also launched in > multi-threaded way in CP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2086) Initial version of local backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2086: Description: This part aims to implement a local execution backend for the compiled “paramserv” function. The idea is to spawn a thread in CP for running the parameter server. And the workers are also launched in multi-threaded way in CP. (was: This part aims to design and implement a local execution backend for the compiled “paramserv” function. The idea is to spawn a thread in CP for running the parameter server. And the workers are also launched in multi-threaded way in CP.) > Initial version of local backend > > > Key: SYSTEMML-2086 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2086 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to implement a local execution backend for the compiled > “paramserv” function. The idea is to spawn a thread in CP for running the > parameter server. And the workers are also launched in multi-threaded way in > CP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2087) Initial version of distributed spark backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2087: Description: This part aims to implement the BSP for spark distributed backend. Hence the idea is to be able to launch a remote parameter server and the workers. > Initial version of distributed spark backend > > > Key: SYSTEMML-2087 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2087 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to implement the BSP for spark distributed backend. Hence the > idea is to be able to launch a remote parameter server and the workers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2302) Second version of execution backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2302: Description: This part aims to complement the updating strategies by adding ASP and SSP. > Second version of execution backend > --- > > Key: SYSTEMML-2302 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2302 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > This part aims to complement the updating strategies by adding ASP and SSP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2086) Initial version of local backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2086: Description: This part aims to implement the BSP strategy for the local execution backend. The idea is to spawn a thread in CP for running the parameter server. And the workers are also launched in multi-threaded way in CP. (was: This part aims to implement a local execution backend for the compiled “paramserv” function. The idea is to spawn a thread in CP for running the parameter server. And the workers are also launched in multi-threaded way in CP.) > Initial version of local backend > > > Key: SYSTEMML-2086 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2086 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to implement the BSP strategy for the local execution backend. > The idea is to spawn a thread in CP for running the parameter server. And the > workers are also launched in multi-threaded way in CP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2084) Implementation of language and compiler extension
[ https://issues.apache.org/jira/browse/SYSTEMML-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2084: Due Date: 25/May/18 (was: 28/May/18) > Implementation of language and compiler extension > - > > Key: SYSTEMML-2084 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2084 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to add an additional language support for the “paramserv” > function in order to be able to compile this new function. Since SystemML > already supports the parameterized builtin function, we can easily extend an > additional operation type and generate a new instruction for the “paramserv” > function. Recently, we have also added a new “eval” built-in function which > is capable to pass a function pointer as argument so that it can be called in > runtime. Similar to it, we would need to extend the inter-procedural analysis > to avoid removing unused constructed functions in the presence of > second-order “paramserv” function. Because the referenced functions, i.e., > the aggregate function and update function, should be present in runtime. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Due Date: 1/Jun/18 (was: 4/Jun/18) > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. > Push/Pull service: > In general, we could launch a parameter server inside (local multi-thread > backend) or outside (spark distributed backend) of CP to provide the pull and > push service. For the moment, all the weights and biases are saved in a > hashmap using a key, e.g., "global parameter". Each worker's gradients will > be put into the hashmap seperately with a given key. And the exchange between > server and workers will be implemented by TCP. Hence, we could easily > broadcast the IP address and the port number to the workers. And then the > workers can send the gradients and retrieve the new parameters via TCP > socket. The server will also spawn a thread which retrieves the gradients by > polling the hashmap using relevant keys and aggregates them. At last, it > updates the global parameter in the hashmap. > Synchronization: > We also need to implement the synchronization between workers and parameter > server to be able to bring more parameter update strategies, e.g., the > stale-synchronous strategy needs a hyperparameter "staleness" to define the > waiting interval. The idea is to maintain a vector clock recording all > workers' clock in the server. Each time when an iteration in side of worker > finishes, it waits server to give a signal, i.e., to send a request for > calculating the staleness according to the vector clock. And when the server > receives the gradients from certain worker, it will increment the vector > clock for this worker. So we could define BSP as "staleness==0", ASP as > "staleness==-1" and SSP as "staleness==N". > A diagram of the parameter server architecture is shown below. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2086) Initial version of local backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2086: Due Date: 22/Jun/18 (was: 25/Jun/18) > Initial version of local backend > > > Key: SYSTEMML-2086 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2086 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to implement the BSP strategy for the local execution backend. > The idea is to spawn a thread in CP for running the parameter server. And the > workers are also launched in multi-threaded way in CP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2087) Initial version of distributed spark backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2087: Due Date: 6/Jul/18 (was: 9/Jul/18) > Initial version of distributed spark backend > > > Key: SYSTEMML-2087 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2087 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to implement the BSP for spark distributed backend. Hence the > idea is to be able to launch a remote parameter server and the workers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2302) Second version of execution backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2302: Due Date: 27/Jul/18 (was: 6/Aug/18) > Second version of execution backend > --- > > Key: SYSTEMML-2302 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2302 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > This part aims to complement the updating strategies by adding ASP and SSP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). And the function will return a trained model in struct format. (was: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are interested in providing the model (which will be a struct-like data structure consisting the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). And the function will return a trained model in struct format.) > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be > _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, > agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are > interested in providing the model (which will be a struct-like data structure > consisting of the weights, the biases and the hyperparameters), the training > features and labels, the validation features and labels, the batch update > function, the update strategy (e.g. sync, async, hogwild!, > stale-synchronous), the update frequency (e.g. epoch or mini-batch), the > gradient aggregation function, the number of epoch, the batch size, the > degree of parallelism as well as the checkpointing strategy (e.g. rollback > recovery). And the function will return a trained model in struct format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function (i.e., gradient calculation func), the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). And the function will return a trained model in struct format. (was: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). And the function will return a trained model in struct format.) > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be > _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, > agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are > interested in providing the model (which will be a struct-like data structure > consisting of the weights, the biases and the hyperparameters), the training > features and labels, the validation features and labels, the batch update > function (i.e., gradient calculation func), the update strategy (e.g. sync, > async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or > mini-batch), the gradient aggregation function, the number of epoch, the > batch size, the degree of parallelism as well as the checkpointing strategy > (e.g. rollback recovery). And the function will return a trained model in > struct format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2298) Preparation of dev environment
[ https://issues.apache.org/jira/browse/SYSTEMML-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2298. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Preparation of dev environment > -- > > Key: SYSTEMML-2298 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2298 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > > During the bonding time, all the development environment should be well > prepared. The native library OpenBLAS should be installed in order to run the > MNIST LeNet example. And then by leveraging the MNIST LeNet data generator > ([http://leon.bottou.org/projects/infimnist]), we could generate 256k > instances to train the model. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be: {code:java} model'=paramserv(model, X, y, X_val, y_val, upd=fun1, agg=fun2, mode=BSP, freq=EPOCH, epochs=100, batchsize=64, k=7, hyperparam=params, checkpoint=NONE){code} We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function (i.e., gradient calculation func), the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). And the function will return a trained model in struct format. was:The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function (i.e., gradient calculation func), the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). And the function will return a trained model in struct format. > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be: > > {code:java} > model'=paramserv(model, X, y, X_val, y_val, upd=fun1, agg=fun2, mode=BSP, > freq=EPOCH, epochs=100, batchsize=64, k=7, hyperparam=params, > checkpoint=NONE){code} > > We are interested in providing the model (which will be a struct-like data > structure consisting of the weights, the biases and the hyperparameters), the > training features and labels, the validation features and labels, the batch > update function (i.e., gradient calculation func), the update strategy (e.g. > sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch > or mini-batch), the gradient aggregation function, the number of epoch, the > batch size, the degree of parallelism as well as the checkpointing strategy > (e.g. rollback recovery). And the function will return a trained model in > struct format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be: {code:java} model'=paramserv(model, X, y, X_val, y_val, upd=fun1, agg=fun2, mode=BSP, freq=EPOCH, epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, hyperparam=params, checkpoint=NONE){code} We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function (i.e., gradient calculation func), the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). And the function will return a trained model in struct format. was: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be: {code:java} model'=paramserv(model, X, y, X_val, y_val, upd=fun1, agg=fun2, mode=BSP, freq=EPOCH, epochs=100, batchsize=64, k=7, hyperparam=params, checkpoint=NONE){code} We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function (i.e., gradient calculation func), the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). And the function will return a trained model in struct format. > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be: > > {code:java} > model'=paramserv(model, X, y, X_val, y_val, upd=fun1, agg=fun2, mode=BSP, > freq=EPOCH, epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, > hyperparam=params, checkpoint=NONE){code} > > We are interested in providing the model (which will be a struct-like data > structure consisting of the weights, the biases and the hyperparameters), the > training features and labels, the validation features and labels, the batch > update function (i.e., gradient calculation func), the update strategy (e.g. > sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch > or mini-batch), the gradient aggregation function, the number of epoch, the > batch size, the degree of parallelism as well as the checkpointing strategy > (e.g. rollback recovery). And the function will return a trained model in > struct format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be: {code:java} model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", mode="BSP", freq="EPOCH", epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, hyperparam=params, checkpoint=NONE){code} We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function (i.e., gradient calculation func), the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism, the data partition scheme, a list of additional hyper parameters, as well as the checkpointing strategy. And the function will return a trained model in struct format. *Inputs*: * model [: a list consisting of the weight and bias matrices * X : training features matrix * y : training label matrix * X_val : validation features matrix * y_val : validation label matrix * upd : the name of gradient calculation function * agg : the name of gradient aggregation function * mode (options: BSP, ASP, SSP): the updating mode * freq (options: EPOCH, BATCH): the frequence of updates * epochs : the number of epoch * batchsize : the size of batch * k : the degree of parallelism * scheme (options: disjoint_contiguous, disjoint_round_robin, disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how the data is distributed across workers * hyperparam : a list consisting of the additional hyper parameters, e.g., learning rate, momentum * checkpoint (options: NONE, EPOCH, EPOCH10): the checkpoint strategy, we could set a checkpoint for each epoch or each 10 epochs was: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be: {code:java} model'=paramserv(model, X, y, X_val, y_val, upd=fun1, agg=fun2, mode=BSP, freq=EPOCH, epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, hyperparam=params, checkpoint=NONE){code} We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function (i.e., gradient calculation func), the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). And the function will return a trained model in struct format. > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be: > {code:java} > model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", > mode="BSP", freq="EPOCH", epochs=100, batchsize=64, k=7, > scheme=disjoint_contiguous, hyperparam=params, checkpoint=NONE){code} > We are interested in providing the model (which will be a struct-like data > structure consisting of the weights, the biases and the hyperparameters), the > training features and labels, the validation features and labels, the batch > update function (i.e., gradient calculation func), the update strategy (e.g. > sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch > or mini-batch), the gradient aggregation function, the number of epoch, the > batch size, the degree of parallelism, the data partition scheme, a list of > additional hyper parameters, as well as the checkpointing strategy. And the > function will return a trained model in struct format. > *Inputs*: > * model [: a list consisting of the weight and bias matrices > * X : training features matrix > * y : training label matrix > * X_val : validation features matrix > * y_val : validation label matrix > * upd : the name of gradient calculation function > * agg : the name of gradient aggregation function > * mode (options: BSP, ASP, SSP): the updating mode > * freq (options
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be: {code:java} model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", mode="BSP", freq="EPOCH", epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, hyperparam=params, checkpoint=NONE){code} We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function (i.e., gradient calculation func), the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism, the data partition scheme, a list of additional hyper parameters, as well as the checkpointing strategy. And the function will return a trained model in struct format. *Inputs*: * model : a list consisting of the weight and bias matrices * X : training features matrix * y : training label matrix * X_val : validation features matrix * y_val : validation label matrix * upd : the name of gradient calculation function * agg : the name of gradient aggregation function * mode (options: BSP, ASP, SSP): the updating mode * freq (options: EPOCH, BATCH): the frequence of updates * epochs : the number of epoch * batchsize : the size of batch * k : the degree of parallelism * scheme (options: disjoint_contiguous, disjoint_round_robin, disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how the data is distributed across workers * hyperparam [optional]: a list consisting of the additional hyper parameters, e.g., learning rate, momentum * checkpoint (options: NONE(default), EPOCH, EPOCH10) [optional]: the checkpoint strategy, we could set a checkpoint for each epoch or each 10 epochs was: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be: {code:java} model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", mode="BSP", freq="EPOCH", epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, hyperparam=params, checkpoint=NONE){code} We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function (i.e., gradient calculation func), the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism, the data partition scheme, a list of additional hyper parameters, as well as the checkpointing strategy. And the function will return a trained model in struct format. *Inputs*: * model [: a list consisting of the weight and bias matrices * X : training features matrix * y : training label matrix * X_val : validation features matrix * y_val : validation label matrix * upd : the name of gradient calculation function * agg : the name of gradient aggregation function * mode (options: BSP, ASP, SSP): the updating mode * freq (options: EPOCH, BATCH): the frequence of updates * epochs : the number of epoch * batchsize : the size of batch * k : the degree of parallelism * scheme (options: disjoint_contiguous, disjoint_round_robin, disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how the data is distributed across workers * hyperparam : a list consisting of the additional hyper parameters, e.g., learning rate, momentum * checkpoint (options: NONE, EPOCH, EPOCH10): the checkpoint strategy, we could set a checkpoint for each epoch or each 10 epochs > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be: > {code:java} > model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", > mode="BSP", freq="EPOCH", epochs=100, batchsize=64, k=7, > scheme=disjoint_contiguous, hyperparam=params, checkpoint=NONE){code} > We are interested in providing the model (which will be a struct-like data >
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be: {code:java} model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", mode="BSP", freq="EPOCH", epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, hyperparam=params, checkpoint=NONE){code} We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function (i.e., gradient calculation func), the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism, the data partition scheme, a list of additional hyper parameters, as well as the checkpointing strategy. And the function will return a trained model in struct format. *Inputs*: * model : a list consisting of the weight and bias matrices * X : training features matrix * y : training label matrix * X_val : validation features matrix * y_val : validation label matrix * upd : the name of gradient calculation function * agg : the name of gradient aggregation function * mode (options: BSP, ASP, SSP): the updating mode * freq (options: EPOCH, BATCH): the frequence of updates * epochs : the number of epoch * batchsize : the size of batch * k : the degree of parallelism * scheme (options: disjoint_contiguous, disjoint_round_robin, disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how the data is distributed across workers * hyperparam [optional]: a list consisting of the additional hyper parameters, e.g., learning rate, momentum * checkpoint (options: NONE(default), EPOCH, EPOCH10) [optional]: the checkpoint strategy, we could set a checkpoint for each epoch or each 10 epochs Output: * model' : a list consisting of the updated weight and bias matrices was: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be: {code:java} model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", mode="BSP", freq="EPOCH", epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, hyperparam=params, checkpoint=NONE){code} We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function (i.e., gradient calculation func), the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism, the data partition scheme, a list of additional hyper parameters, as well as the checkpointing strategy. And the function will return a trained model in struct format. *Inputs*: * model : a list consisting of the weight and bias matrices * X : training features matrix * y : training label matrix * X_val : validation features matrix * y_val : validation label matrix * upd : the name of gradient calculation function * agg : the name of gradient aggregation function * mode (options: BSP, ASP, SSP): the updating mode * freq (options: EPOCH, BATCH): the frequence of updates * epochs : the number of epoch * batchsize : the size of batch * k : the degree of parallelism * scheme (options: disjoint_contiguous, disjoint_round_robin, disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how the data is distributed across workers * hyperparam [optional]: a list consisting of the additional hyper parameters, e.g., learning rate, momentum * checkpoint (options: NONE(default), EPOCH, EPOCH10) [optional]: the checkpoint strategy, we could set a checkpoint for each epoch or each 10 epochs > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be: > {code:java} > model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", > mode="BSP", freq="EPOCH", epochs=100, batchsize=64, k=7, > scheme=disjoint_contiguous, hyperparam=par
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be: {code:java} model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", mode="BSP", freq="EPOCH", epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, hyperparam=params, checkpoint=NONE){code} We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function (i.e., gradient calculation func), the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism, the data partition scheme, a list of additional hyper parameters, as well as the checkpointing strategy. And the function will return a trained model in struct format. *Inputs*: * model : a list consisting of the weight and bias matrices * X : training features matrix * y : training label matrix * X_val : validation features matrix * y_val : validation label matrix * upd : the name of gradient calculation function * agg : the name of gradient aggregation function * mode (options: BSP, ASP, SSP): the updating mode * freq (options: EPOCH, BATCH): the frequence of updates * epochs : the number of epoch * batchsize : the size of batch * k : the degree of parallelism * scheme (options: disjoint_contiguous, disjoint_round_robin, disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how the data is distributed across workers * hyperparam [optional]: a list consisting of the additional hyper parameters, e.g., learning rate, momentum * checkpoint (options: NONE(default), EPOCH, EPOCH10) [optional]: the checkpoint strategy, we could set a checkpoint for each epoch or each 10 epochs *Output*: * model' : a list consisting of the updated weight and bias matrices was: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be: {code:java} model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", mode="BSP", freq="EPOCH", epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, hyperparam=params, checkpoint=NONE){code} We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function (i.e., gradient calculation func), the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism, the data partition scheme, a list of additional hyper parameters, as well as the checkpointing strategy. And the function will return a trained model in struct format. *Inputs*: * model : a list consisting of the weight and bias matrices * X : training features matrix * y : training label matrix * X_val : validation features matrix * y_val : validation label matrix * upd : the name of gradient calculation function * agg : the name of gradient aggregation function * mode (options: BSP, ASP, SSP): the updating mode * freq (options: EPOCH, BATCH): the frequence of updates * epochs : the number of epoch * batchsize : the size of batch * k : the degree of parallelism * scheme (options: disjoint_contiguous, disjoint_round_robin, disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how the data is distributed across workers * hyperparam [optional]: a list consisting of the additional hyper parameters, e.g., learning rate, momentum * checkpoint (options: NONE(default), EPOCH, EPOCH10) [optional]: the checkpoint strategy, we could set a checkpoint for each epoch or each 10 epochs Output: * model' : a list consisting of the updated weight and bias matrices > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be: > {code:java} > model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", > mode="BSP", freq="EPO
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be: {code:java} model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", mode="BSP", freq="EPOCH", epochs=100, batchsize=64, k=7, scheme="disjoint_contiguous", hyperparam=params, checkpoint="NONE"){code} We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function (i.e., gradient calculation func), the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism, the data partition scheme, a list of additional hyper parameters, as well as the checkpointing strategy. And the function will return a trained model in struct format. *Inputs*: * model : a list consisting of the weight and bias matrices * X : training features matrix * y : training label matrix * X_val : validation features matrix * y_val : validation label matrix * upd : the name of gradient calculation function * agg : the name of gradient aggregation function * mode (options: BSP, ASP, SSP): the updating mode * freq (options: EPOCH, BATCH): the frequence of updates * epochs : the number of epoch * batchsize : the size of batch * k : the degree of parallelism * scheme (options: disjoint_contiguous, disjoint_round_robin, disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how the data is distributed across workers * hyperparam [optional]: a list consisting of the additional hyper parameters, e.g., learning rate, momentum * checkpoint (options: NONE(default), EPOCH, EPOCH10) [optional]: the checkpoint strategy, we could set a checkpoint for each epoch or each 10 epochs *Output*: * model' : a list consisting of the updated weight and bias matrices was: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be: {code:java} model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", mode="BSP", freq="EPOCH", epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, hyperparam=params, checkpoint=NONE){code} We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function (i.e., gradient calculation func), the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism, the data partition scheme, a list of additional hyper parameters, as well as the checkpointing strategy. And the function will return a trained model in struct format. *Inputs*: * model : a list consisting of the weight and bias matrices * X : training features matrix * y : training label matrix * X_val : validation features matrix * y_val : validation label matrix * upd : the name of gradient calculation function * agg : the name of gradient aggregation function * mode (options: BSP, ASP, SSP): the updating mode * freq (options: EPOCH, BATCH): the frequence of updates * epochs : the number of epoch * batchsize : the size of batch * k : the degree of parallelism * scheme (options: disjoint_contiguous, disjoint_round_robin, disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how the data is distributed across workers * hyperparam [optional]: a list consisting of the additional hyper parameters, e.g., learning rate, momentum * checkpoint (options: NONE(default), EPOCH, EPOCH10) [optional]: the checkpoint strategy, we could set a checkpoint for each epoch or each 10 epochs *Output*: * model' : a list consisting of the updated weight and bias matrices > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be: > {code:java} > model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", > mode="BSP", fre
[jira] [Created] (SYSTEMML-2317) Implementation of language extension
LI Guobao created SYSTEMML-2317: --- Summary: Implementation of language extension Key: SYSTEMML-2317 URL: https://issues.apache.org/jira/browse/SYSTEMML-2317 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao It aims to extend the parsing and validation at language level. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2318) Hops, lops, instruction generation
LI Guobao created SYSTEMML-2318: --- Summary: Hops, lops, instruction generation Key: SYSTEMML-2318 URL: https://issues.apache.org/jira/browse/SYSTEMML-2318 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao It aims to implement the extension of hops, lops and instruction for the new paramserv function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2319) IPA integration
LI Guobao created SYSTEMML-2319: --- Summary: IPA integration Key: SYSTEMML-2319 URL: https://issues.apache.org/jira/browse/SYSTEMML-2319 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao It aims to extend the IPA to avoid removing the referenced functions due to the fact that the paramserv function is a second-order function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2320) Parfor integration
LI Guobao created SYSTEMML-2320: --- Summary: Parfor integration Key: SYSTEMML-2320 URL: https://issues.apache.org/jira/browse/SYSTEMML-2320 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao It aims to garanti the robustness for the case that the paramserv function is used inside a parfor statement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2084) Implementation of language and compiler extension
[ https://issues.apache.org/jira/browse/SYSTEMML-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2084: Issue Type: Technical task (was: Sub-task) > Implementation of language and compiler extension > - > > Key: SYSTEMML-2084 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2084 > Project: SystemML > Issue Type: Technical task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to add an additional language support for the “paramserv” > function in order to be able to compile this new function. Since SystemML > already supports the parameterized builtin function, we can easily extend an > additional operation type and generate a new instruction for the “paramserv” > function. Recently, we have also added a new “eval” built-in function which > is capable to pass a function pointer as argument so that it can be called in > runtime. Similar to it, we would need to extend the inter-procedural analysis > to avoid removing unused constructed functions in the presence of > second-order “paramserv” function. Because the referenced functions, i.e., > the aggregate function and update function, should be present in runtime. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Issue Type: Technical task (was: Sub-task) > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Technical task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. > Push/Pull service: > In general, we could launch a parameter server inside (local multi-thread > backend) or outside (spark distributed backend) of CP to provide the pull and > push service. For the moment, all the weights and biases are saved in a > hashmap using a key, e.g., "global parameter". Each worker's gradients will > be put into the hashmap seperately with a given key. And the exchange between > server and workers will be implemented by TCP. Hence, we could easily > broadcast the IP address and the port number to the workers. And then the > workers can send the gradients and retrieve the new parameters via TCP > socket. The server will also spawn a thread which retrieves the gradients by > polling the hashmap using relevant keys and aggregates them. At last, it > updates the global parameter in the hashmap. > Synchronization: > We also need to implement the synchronization between workers and parameter > server to be able to bring more parameter update strategies, e.g., the > stale-synchronous strategy needs a hyperparameter "staleness" to define the > waiting interval. The idea is to maintain a vector clock recording all > workers' clock in the server. Each time when an iteration in side of worker > finishes, it waits server to give a signal, i.e., to send a request for > calculating the staleness according to the vector clock. And when the server > receives the gradients from certain worker, it will increment the vector > clock for this worker. So we could define BSP as "staleness==0", ASP as > "staleness==-1" and SSP as "staleness==N". > A diagram of the parameter server architecture is shown below. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2087) Initial version of distributed spark backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2087: Description: This part aims to implement the parameter server for spark distributed backend. In general, we could launch a parameter server in a host to provide the pull and push service. For the moment, all the weights and biases are saved in a hashmap using a key, e.g., "global parameter". Each worker's gradients will be put into the hashmap seperately with a given key. And the exchange between server and workers will be implemented by netty RPC. Hence, we could easily broadcast the IP address and the port number to the workers. And then the workers can send the gradients and retrieve the new parameters via TCP socket. The server will also spawn a thread which retrieves the gradients by polling the hashmap using relevant keys and aggregates them. At last, it updates the global parameter in the hashmap. (was: This part aims to implement the BSP for spark distributed backend. Hence the idea is to be able to launch a remote parameter server and the workers.) > Initial version of distributed spark backend > > > Key: SYSTEMML-2087 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2087 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to implement the parameter server for spark distributed > backend. In general, we could launch a parameter server in a host to provide > the pull and push service. For the moment, all the weights and biases are > saved in a hashmap using a key, e.g., "global parameter". Each worker's > gradients will be put into the hashmap seperately with a given key. And the > exchange between server and workers will be implemented by netty RPC. Hence, > we could easily broadcast the IP address and the port number to the workers. > And then the workers can send the gradients and retrieve the new parameters > via TCP socket. The server will also spawn a thread which retrieves the > gradients by polling the hashmap using relevant keys and aggregates them. At > last, it updates the global parameter in the hashmap. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2087) Initial version of distributed spark backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2087: Description: This part aims to implement the parameter server for spark distributed backend. In general, we could launch a parameter server in a host to provide the pull and push service. For the moment, all the weights and biases are saved in a hashmap using a key, e.g., "global parameter". Each worker's gradients will be put into the hashmap seperately with a given key. And the exchange between server and workers will be implemented by netty RPC. Hence, we could easily broadcast the IP address and the port number to the workers. And then the workers can send the gradients and retrieve the new parameters via netty RPC. The server will also spawn a thread which retrieves the gradients by polling the hashmap using relevant keys and aggregates them. At last, it updates the global parameter in the hashmap. (was: This part aims to implement the parameter server for spark distributed backend. In general, we could launch a parameter server in a host to provide the pull and push service. For the moment, all the weights and biases are saved in a hashmap using a key, e.g., "global parameter". Each worker's gradients will be put into the hashmap seperately with a given key. And the exchange between server and workers will be implemented by netty RPC. Hence, we could easily broadcast the IP address and the port number to the workers. And then the workers can send the gradients and retrieve the new parameters via TCP socket. The server will also spawn a thread which retrieves the gradients by polling the hashmap using relevant keys and aggregates them. At last, it updates the global parameter in the hashmap.) > Initial version of distributed spark backend > > > Key: SYSTEMML-2087 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2087 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to implement the parameter server for spark distributed > backend. In general, we could launch a parameter server in a host to provide > the pull and push service. For the moment, all the weights and biases are > saved in a hashmap using a key, e.g., "global parameter". Each worker's > gradients will be put into the hashmap seperately with a given key. And the > exchange between server and workers will be implemented by netty RPC. Hence, > we could easily broadcast the IP address and the port number to the workers. > And then the workers can send the gradients and retrieve the new parameters > via netty RPC. The server will also spawn a thread which retrieves the > gradients by polling the hashmap using relevant keys and aggregates them. At > last, it updates the global parameter in the hashmap. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2321) Aggregation service
LI Guobao created SYSTEMML-2321: --- Summary: Aggregation service Key: SYSTEMML-2321 URL: https://issues.apache.org/jira/browse/SYSTEMML-2321 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao The aggregation service is independant of local or remote workers. It is responsible for executing the parameter updating. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2322) Local workers
LI Guobao created SYSTEMML-2322: --- Summary: Local workers Key: SYSTEMML-2322 URL: https://issues.apache.org/jira/browse/SYSTEMML-2322 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao It aims to implement the local workers. And it also contains the data management such as data distribution, program separation via function replication. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2323) Checkpointing
LI Guobao created SYSTEMML-2323: --- Summary: Checkpointing Key: SYSTEMML-2323 URL: https://issues.apache.org/jira/browse/SYSTEMML-2323 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao It aims to add the auxilary checkpointing service. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2323) Checkpointing
[ https://issues.apache.org/jira/browse/SYSTEMML-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2323: Description: It aims to add the auxiliary checkpointing service. (was: It aims to add the auxilary checkpointing service.) > Checkpointing > - > > Key: SYSTEMML-2323 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2323 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > It aims to add the auxiliary checkpointing service. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Description: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. Synchronization: We also need to implement the synchronization between workers and parameter server to be able to bring more parameter update strategies, e.g., the stale-synchronous strategy needs a hyperparameter "staleness" to define the waiting interval. The idea is to maintain a vector clock recording all workers' clock in the server. Each time when an iteration in side of worker finishes, it waits server to give a signal, i.e., to send a request for calculating the staleness according to the vector clock. And when the server receives the gradients from certain worker, it will increment the vector clock for this worker. So we could define BSP as "staleness==0", ASP as "staleness==-1" and SSP as "staleness==N". A diagram of the parameter server architecture is shown below. was: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. Push/Pull service: In general, we could launch a parameter server inside (local multi-thread backend) or outside (spark distributed backend) of CP to provide the pull and push service. For the moment, all the weights and biases are saved in a hashmap using a key, e.g., "global parameter". Each worker's gradients will be put into the hashmap seperately with a given key. And the exchange between server and workers will be implemented by TCP. Hence, we could easily broadcast the IP address and the port number to the workers. And then the workers can send the gradients and retrieve the new parameters via TCP socket. The server will also spawn a thread which retrieves the gradients by polling the hashmap using relevant keys and aggregates them. At last, it updates the global parameter in the hashmap. Synchronization: We also need to implement the synchronization between workers and parameter server to be able to bring more parameter update strategies, e.g., the stale-synchronous strategy needs a hyperparameter "staleness" to define the waiting interval. The idea is to maintain a vector clock recording all workers' clock in the server. Each time when an iteration in side of worker finishes, it waits server to give a signal, i.e., to send a request for calculating the staleness according to the vector clock. And when the server receives the gradients from certain worker, it will increment the vector clock for this worker. So we could define BSP as "staleness==0", ASP as "staleness==-1" and SSP as "staleness==N". A diagram of the parameter server architecture is shown below. > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Technical task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. > Synchronization: > We also need to implement the synchronization between workers and parameter > server to be able to bring more parameter update strategies, e.g., the > stale-synchronous strategy needs a hyperparameter "staleness" to define the > waiting interval. The idea is to maintain a vector clock recording all > workers' clock in the server. Each time when an iteration in side of worker > finishes, it waits server to give a signal, i.e., to send a request for > calculating the staleness according to the vector clock. And when the server > receives the gradients from certain worker, it will increment the vector > clock for this worker. So we could define BSP as "staleness==0", ASP as > "staleness==-1" and SSP as "staleness==N". > A diagram of the parameter server architecture is shown below. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Description: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. A diagram of the parameter server architecture is shown below. was: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. Synchronization: We also need to implement the synchronization between workers and parameter server to be able to bring more parameter update strategies, e.g., the stale-synchronous strategy needs a hyperparameter "staleness" to define the waiting interval. The idea is to maintain a vector clock recording all workers' clock in the server. Each time when an iteration in side of worker finishes, it waits server to give a signal, i.e., to send a request for calculating the staleness according to the vector clock. And when the server receives the gradients from certain worker, it will increment the vector clock for this worker. So we could define BSP as "staleness==0", ASP as "staleness==-1" and SSP as "staleness==N". A diagram of the parameter server architecture is shown below. > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Technical task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. > A diagram of the parameter server architecture is shown below. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2324) Synchronization
LI Guobao created SYSTEMML-2324: --- Summary: Synchronization Key: SYSTEMML-2324 URL: https://issues.apache.org/jira/browse/SYSTEMML-2324 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao We also need to implement the synchronization between workers and parameter server to be able to bring more parameter update strategies, e.g., the stale-synchronous strategy needs a hyperparameter "staleness" to define the waiting interval. The idea is to maintain a vector clock recording all workers' clock in the server. Each time when an iteration in side of worker finishes, it waits server to give a signal, i.e., to send a request for calculating the staleness according to the vector clock. And when the server receives the gradients from certain worker, it will increment the vector clock for this worker. So we could define BSP as "staleness==0", ASP as "staleness==-1" and SSP as "staleness==N". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Initial version of local backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Description: A single node parameter server acts as a data-parallel parameter server. A diagram of the parameter server architecture is shown below. (was: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. A diagram of the parameter server architecture is shown below.) Summary: Initial version of local backend (was: Single-node parameter server primitives) > Initial version of local backend > > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Technical task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. A > diagram of the parameter server architecture is shown below. -- This message was sent by Atlassian JIRA (v7.6.3#76005)