from:"LI Guobao \(JIRA\)"

[jira] [Commented] (SYSTEMML-2083) Language and runtime for parameter servers

2018-02-18 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368702#comment-16368702
 ] 

LI Guobao commented on SYSTEMML-2083:
-

Hello,

I'm LI Guobao, 2nd year computer science in master from South China University 
of Technology. And I have also graduated from Polytech Nantes in France for a 
master degree of Computer Science. Nowadays, I'm contributor of an open-source 
project Alien4Cloud and also have some experience in distributed computing 
framework. And I'm extremely interested in contributing to this project.

Thank you in advance

> Language and runtime for parameter servers
> --
>
> Key: SYSTEMML-2083
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2083
> Project: SystemML
>  Issue Type: Epic
>Reporter: Matthias Boehm
>Priority: Major
>  Labels: gsoc2018
> Attachments: image-2018-02-14-12-18-48-932.png, 
> image-2018-02-14-12-21-00-932.png, image-2018-02-14-12-31-37-563.png
>
>
> SystemML already provides a rich set of execution strategies ranging from 
> local operations to large-scale computation on MapReduce or Spark. In this 
> context, we support both data-parallel (multi-threaded or distributed 
> operations) as well as task-parallel computation (multi-threaded or 
> distributed parfor loops). This epic aims to complement the existing 
> execution strategies by language and runtime primitives for parameter 
> servers, i.e., model-parallel execution. We use the terminology of 
> model-parallel execution with distributed data and distributed model to 
> differentiate them from the existing data-parallel operations. Target 
> applications are distributed deep learning and mini-batch algorithms in 
> general. These new abstractions will help making SystemML a unified framework 
> for small- and large-scale machine learning that supports all three major 
> execution strategies in a single framework.
>  
> A major challenge is the integration of stateful parameter servers and their 
> common push/pull primitives into an otherwise functional (and thus, 
> stateless) language. We will approach this challenge via a new builtin 
> function {{paramserv}} which internally maintains state but at the same time 
> fits into the runtime framework of stateless operations.
> Furthermore, we are interested in providing (1) different runtime backends 
> (local and distributed), (2) different parameter server modes (synchronous, 
> asynchronous, hogwild!, stale-synchronous), (3) different update frequencies 
> (batch, multi-batch, epoch), as well as (4) different architectures for 
> distributed data (1 parameter server, k workers) and distributed model (k1 
> parameter servers, k2 workers). 
>  
> *Note for GSOC students:* This is large project which will be broken down 
> into sub projects, so everybody will be having their share of pie.
> *Prerequistes:* Java, machine learning experience is a plus but not required.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (SYSTEMML-2083) Language and runtime for parameter servers

2018-02-20 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370151#comment-16370151
 ] 

LI Guobao commented on SYSTEMML-2083:
-

Hi,

I have launched some samples of SystemML locally on my pc and looked through 
the papers above about SystemML as well as the parameter servers. In my head I 
gain an overview about the 3 strategies about updating the parameters. I'm 
motivated to know more about the runtime of buildin function and the 
off-the-shelf solution for the parameter servers. So I'd like to know if I can 
join the discussion group for this subject?

Thank you in advance for the response,

Guobao

> Language and runtime for parameter servers
> --
>
> Key: SYSTEMML-2083
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2083
> Project: SystemML
>  Issue Type: Epic
>Reporter: Matthias Boehm
>Priority: Major
>  Labels: gsoc2018
> Attachments: image-2018-02-14-12-18-48-932.png, 
> image-2018-02-14-12-21-00-932.png, image-2018-02-14-12-31-37-563.png
>
>
> SystemML already provides a rich set of execution strategies ranging from 
> local operations to large-scale computation on MapReduce or Spark. In this 
> context, we support both data-parallel (multi-threaded or distributed 
> operations) as well as task-parallel computation (multi-threaded or 
> distributed parfor loops). This epic aims to complement the existing 
> execution strategies by language and runtime primitives for parameter 
> servers, i.e., model-parallel execution. We use the terminology of 
> model-parallel execution with distributed data and distributed model to 
> differentiate them from the existing data-parallel operations. Target 
> applications are distributed deep learning and mini-batch algorithms in 
> general. These new abstractions will help making SystemML a unified framework 
> for small- and large-scale machine learning that supports all three major 
> execution strategies in a single framework.
>  
> A major challenge is the integration of stateful parameter servers and their 
> common push/pull primitives into an otherwise functional (and thus, 
> stateless) language. We will approach this challenge via a new builtin 
> function {{paramserv}} which internally maintains state but at the same time 
> fits into the runtime framework of stateless operations.
> Furthermore, we are interested in providing (1) different runtime backends 
> (local and distributed), (2) different parameter server modes (synchronous, 
> asynchronous, hogwild!, stale-synchronous), (3) different update frequencies 
> (batch, multi-batch, epoch), as well as (4) different architectures for 
> distributed data (1 parameter server, k workers) and distributed model (k1 
> parameter servers, k2 workers). 
>  
> *Note for GSOC students:* This is large project which will be broken down 
> into sub projects, so everybody will be having their share of pie.
> *Prerequistes:* Java, machine learning experience is a plus but not required.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (SYSTEMML-2077) New eval builtin function

2018-02-22 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372920#comment-16372920
 ] 

LI Guobao commented on SYSTEMML-2077:
-

Hello,

Can you give me some information about how the function "eval" works in real 
case? I'd like to take this ticket.

Regards,
Guobao

> New eval builtin function
> -
>
> Key: SYSTEMML-2077
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2077
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Priority: Major
>
> This new eval builtin function aims to provide a concise language construct 
> to evaluate dynamic expressions and functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (SYSTEMML-2077) New eval builtin function

2018-02-23 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374310#comment-16374310
 ] 

LI Guobao commented on SYSTEMML-2077:
-

Hi [~mboehm7], I have a problem of compiling the code. I wanted to launch a 
test in the class MLContextTest. And I found the errors of compilation because 
of the missing classes such as PydmlParser, DmlParser etc... And I saw that 
they were excluded in the file .gitignore. So how can I compile the code for 
launch a test?

Regards,
Guobao

 

> New eval builtin function
> -
>
> Key: SYSTEMML-2077
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2077
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Priority: Major
>
> This new eval builtin function aims to provide a concise language construct 
> to evaluate dynamic expressions and functions.
> Similar to R's eval function 
> (https://stat.ethz.ch/R-manual/R-devel/library/base/html/eval.html), this 
> would allow us to evaluate dynamically constructed expressions. There are two 
> major sub tasks here: the invocation of given function pointers and the 
> evaluation of dynamic expressions given as strings. Initially, we would focus 
> on the former by allowing call such as {{R = eval(fname, A, B, C)}}. So far 
> SystemML does not provide second-order functions which requires explicit 
> {{if-else}} conditions for ensemble learning workloads. With this new 
> {{eval}} function we could store a list of function names in a frame {{F}} 
> and dynamically call them via {{R[i, ] = eval(F[i,1], A, B, C)}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (SYSTEMML-2078) Support for global constants

2018-03-08 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391518#comment-16391518
 ] 

LI Guobao commented on SYSTEMML-2078:
-

I have an idea about this. Could we add a built-in function to attribute a 
value to a global constant variable like this "global(size, 10)"?

> Support for global constants
> 
>
> Key: SYSTEMML-2078
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2078
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Priority: Major
>
> Similar to R, where variables of the surrounding context are accessible, this 
> task aims to introduce global constant variables. Furthermore, we should also 
> add builtin constants such as NaN, INF, and PI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (SYSTEMML-2078) Support for global constants

2018-03-08 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391518#comment-16391518
 ] 

LI Guobao edited comment on SYSTEMML-2078 at 3/8/18 9:50 PM:
-

[~mboehm7] I have an idea about this. Could we add a built-in function to 
attribute a value to a global constant variable like this "global(size, 10)"?


was (Author: guobao):
I have an idea about this. Could we add a built-in function to attribute a 
value to a global constant variable like this "global(size, 10)"?

> Support for global constants
> 
>
> Key: SYSTEMML-2078
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2078
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Priority: Major
>
> Similar to R, where variables of the surrounding context are accessible, this 
> task aims to introduce global constant variables. Furthermore, we should also 
> add builtin constants such as NaN, INF, and PI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (SYSTEMML-2078) Support for global constants

2018-03-11 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394613#comment-16394613
 ] 

LI Guobao commented on SYSTEMML-2078:
-

ok. Thanks [~mboehm7] . I will try to work on this issue by following your 
hints.

> Support for global constants
> 
>
> Key: SYSTEMML-2078
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2078
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Priority: Major
>
> Similar to R, where variables of the surrounding context are accessible, this 
> task aims to introduce global constant variables. Furthermore, we should also 
> add builtin constants such as NaN, INF, and PI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (SYSTEMML-2197) Multi-threaded broadcast creation

2018-03-28 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418050#comment-16418050
 ] 

LI Guobao commented on SYSTEMML-2197:
-

Hi [~mboehm7], could you give me some more details on this issue? Thanks

> Multi-threaded broadcast creation
> -
>
> Key: SYSTEMML-2197
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2197
> Project: SystemML
>  Issue Type: Task
>Reporter: Matthias Boehm
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (SYSTEMML-2197) Multi-threaded broadcast creation

2018-03-29 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419113#comment-16419113
 ] 

LI Guobao commented on SYSTEMML-2197:
-

Thanks [~mboehm7] for the details. And I want to know which test should be 
launched for it? Thanks.

> Multi-threaded broadcast creation
> -
>
> Key: SYSTEMML-2197
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2197
> Project: SystemML
>  Issue Type: Task
>Reporter: Matthias Boehm
>Priority: Major
>
> All spark instructions that broadcast one of the input operands, rely on a 
> shared primitive {{sec.getBroadcastForVariable(var)}} for creating 
> partitioned broadcasts, which are wrapper objects around potentially many 
> broadcast variables to overcome Spark 2GB limitation for compressed 
> broadcasts. Each individual broadcast blocks the matrix into squared blocks 
> for direct access without unnecessary copy per task. So far this broadcast 
> creation is single-threaded. 
> This task aims to parallelize the blocking of the given in-memory matrix into 
> squared blocks 
> (https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/data/PartitionedBlock.java#L82)
>  as well as the subsequent partition creation and actual broadcasting 
> (https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/controlprogram/context/SparkExecutionContext.java#L548).
>  
> For consistency and in order to avoid excessive over-provisioning, this 
> multi-threading should use the common internal thread pool or parallel java 
> streams, which similarly calls the shared {{ForkJoinPool.commonPool}}. An 
> example is the multi-threaded parallelization of RDDs which similarly blocks 
> a given matrix into its squared blocks (see 
> https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/controlprogram/context/SparkExecutionContext.java#L679).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (SYSTEMML-2197) Multi-threaded broadcast creation

2018-04-04 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao reassigned SYSTEMML-2197:
---

Assignee: LI Guobao

> Multi-threaded broadcast creation
> -
>
> Key: SYSTEMML-2197
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2197
> Project: SystemML
>  Issue Type: Task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> All spark instructions that broadcast one of the input operands, rely on a 
> shared primitive {{sec.getBroadcastForVariable(var)}} for creating 
> partitioned broadcasts, which are wrapper objects around potentially many 
> broadcast variables to overcome Spark 2GB limitation for compressed 
> broadcasts. Each individual broadcast blocks the matrix into squared blocks 
> for direct access without unnecessary copy per task. So far this broadcast 
> creation is single-threaded. 
> This task aims to parallelize the blocking of the given in-memory matrix into 
> squared blocks 
> (https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/data/PartitionedBlock.java#L82)
>  as well as the subsequent partition creation and actual broadcasting 
> (https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/controlprogram/context/SparkExecutionContext.java#L548).
>  
> For consistency and in order to avoid excessive over-provisioning, this 
> multi-threading should use the common internal thread pool or parallel java 
> streams, which similarly calls the shared {{ForkJoinPool.commonPool}}. An 
> example is the multi-threaded parallelization of RDDs which similarly blocks 
> a given matrix into its squared blocks (see 
> https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/controlprogram/context/SparkExecutionContext.java#L679).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (SYSTEMML-2077) New eval builtin function

2018-04-04 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao reassigned SYSTEMML-2077:
---

Assignee: LI Guobao

> New eval builtin function
> -
>
> Key: SYSTEMML-2077
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2077
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> This new eval builtin function aims to provide a concise language construct 
> to evaluate dynamic expressions and functions.
> Similar to R's eval function 
> (https://stat.ethz.ch/R-manual/R-devel/library/base/html/eval.html), this 
> would allow us to evaluate dynamically constructed expressions. There are two 
> major sub tasks here: the invocation of given function pointers and the 
> evaluation of dynamic expressions given as strings. Initially, we would focus 
> on the former by allowing call such as {{R = eval(fname, A, B, C)}}. So far 
> SystemML does not provide second-order functions which requires explicit 
> {{if-else}} conditions for ensemble learning workloads. With this new 
> {{eval}} function we could store a list of function names in a frame {{F}} 
> and dynamically call them via {{R[i, ] = eval(F[i,1], A, B, C)}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (SYSTEMML-2232) Logical namespace handling user-defined functions

2018-04-04 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426205#comment-16426205
 ] 

LI Guobao commented on SYSTEMML-2232:
-

Hi [~mboehm7],

As I understand, we will keep the logical name instead of converting it to real 
filename?

> Logical namespace handling user-defined functions
> -
>
> Key: SYSTEMML-2232
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2232
> Project: SystemML
>  Issue Type: Task
>Reporter: Matthias Boehm
>Priority: Major
>
> At script level functions might have logical namespace names such as 
> {{foo::bar()}}, where foo is the namespace name, and bar is the function 
> name. To handle namespace conflicts, SYSTEMML-631 internally replaced the 
> logical namespaces with filenames. For reasons such as improved statistics 
> output and the handling of namespace functions in the recently introduced 
> {{eval}} function (SYSTEMML-2077), it would be good to keep the logical 
> namespace as well.
> This task aims to (1) extend the {{FunctionStatementBlock}} and 
> {{FunctionProgramBlock}} data structures to keep the logical namespace name, 
> (2) extend the parser and compiler accordingly, and (3) modify the statistics 
> maintenance to use the function key (i.e., concatenation of logical namespace 
> and function name) as the opcode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (SYSTEMML-2197) Multi-threaded broadcast creation

2018-04-05 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426745#comment-16426745
 ] 

LI Guobao commented on SYSTEMML-2197:
-

OK. I got an error when launching this test. [~mboehm7], could you help me out 
of this? I have setted the classpath to the systemml module. And the generated 
folders inside target can be also found.
{code:java}
18/04/05 12:40:42 INFO api.DMLScript: END DML run 04/05/2018 12:40:42
starting R script
cmd: Rscript --default-packages=methods,datasets,graphics,grDevices,stats,utils 
./src/test/scripts/functions/binary/matrix_full_other/FullDistributedMatrixMultiplication.R
 
target/testTemp/functions/binary/matrix_full_other/FullDistributedMatrixMultiplicationTest/in/
 
target/testTemp/functions/binary/matrix_full_other/FullDistributedMatrixMultiplicationTest/expected/0.7_0.1/
java.io.IOException: Cannot run program "Rscript": error=2, No such file or 
directory
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
 at java.lang.Runtime.exec(Runtime.java:620)
 at java.lang.Runtime.exec(Runtime.java:450)
 at java.lang.Runtime.exec(Runtime.java:347)
 at 
org.apache.sysml.test.integration.AutomatedTestBase.runRScript(AutomatedTestBase.java:990)
 at 
org.apache.sysml.test.integration.functions.binary.matrix_full_other.FullDistributedMatrixMultiplicationTest.runDistributedMatrixMatrixMultiplicationTest(FullDistributedMatrixMultiplicationTest.java:277)
 at 
org.apache.sysml.test.integration.functions.binary.matrix_full_other.FullDistributedMatrixMultiplicationTest.testDenseSparseRmmSpark(FullDistributedMatrixMultiplicationTest.java:209)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
 at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
 at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
 at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
 at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
 at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
 at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
 at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
 at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
 at 
com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
 at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
 at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: java.io.IOException: error=2, No such file or directory
 at java.lang.UNIXProcess.forkAndExec(Native Method)
 at java.lang.UNIXProcess.(UNIXProcess.java:247)
 at java.lang.ProcessImpl.start(ProcessImpl.java:134)
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
 ... 32 more
{code}

> Multi-threaded broadcast creation
> -
>
> Key: SYSTEMML-2197
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2197
> Project: SystemML
>  Issue Type: Task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> All spark instructions that broadcast one of the input operands, rely on a 
> shared primitive {{sec.getBroadcastForVariable(var)}} for creating 
> partitioned broadcasts, which are wrapper objects around potentially many 
> broadcast variables to overcome Spark 2GB limitation for compressed 
> broadcasts. Each individual broadcast blocks the matrix into squared blocks 
> for direct access without unnecessary copy per task. So far this broadcast 
> creation is single-threaded. 
> This task aims to parallelize the blocking of the given in-memory matrix into 
> squared blocks 
> (https:

[jira] [Commented] (SYSTEMML-2197) Multi-threaded broadcast creation

2018-04-06 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428709#comment-16428709
 ] 

LI Guobao commented on SYSTEMML-2197:
-

Thanks, I successfully launched this test and got 4 failed 
tests(testDenseDenseMapmmMR, testDenseSparseMapmmMR, testSparseDenseMapmmMR, 
testSparseSparseMapmmMR). And it seems to concern the MR backend. When 
reswitching to the master branch, I got the same result. So is it a detected 
bug? Could I ignore it? Thanks for the response. Here is the stack information:
{code:java}
18/04/06 19:38:48 ERROR api.DMLScript: Failed to execute DML script.
org.apache.sysml.runtime.DMLRuntimeException: 
org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in program 
block generated from statement block between lines 23 and 27 -- Error 
evaluating instruction: jobtype = GMR 
input labels = [_mVar13, _mVar14] 
recReader inst = 
rand inst = 
mapper inst = 
MR°mapmm°0·MATRIX·DOUBLE°1·MATRIX·DOUBLE°2·MATRIX·DOUBLE°RIGHT°false 
shuffle inst = 
agg inst = MR°ak+°2·MATRIX·DOUBLE°3·MATRIX·DOUBLE°true°NONE 
other inst = 
output labels = [pVar15] 
result indices = ,3 
num reducers = 10 
replication = 1 

at org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:123)
at 
org.apache.sysml.api.ScriptExecutorUtils.executeRuntimeProgram(ScriptExecutorUtils.java:97)
at org.apache.sysml.api.DMLScript.execute(DMLScript.java:744)
at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:515)
at org.apache.sysml.api.DMLScript.main(DMLScript.java:246)
at 
org.apache.sysml.test.integration.AutomatedTestBase.runTest(AutomatedTestBase.java:1214)
at 
org.apache.sysml.test.integration.functions.binary.matrix_full_other.FullDistributedMatrixMultiplicationTest.runDistributedMatrixMatrixMultiplicationTest(FullDistributedMatrixMultiplicationTest.java:276)
at 
org.apache.sysml.test.integration.functions.binary.matrix_full_other.FullDistributedMatrixMultiplicationTest.testSparseSparseMapmmMR(FullDistributedMatrixMultiplicationTest.java:101){code}

> Multi-threaded broadcast creation
> -
>
> Key: SYSTEMML-2197
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2197
> Project: SystemML
>  Issue Type: Task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> All spark instructions that broadcast one of the input operands, rely on a 
> shared primitive {{sec.getBroadcastForVariable(var)}} for creating 
> partitioned broadcasts, which are wrapper objects around potentially many 
> broadcast variables to overcome Spark 2GB limitation for compressed 
> broadcasts. Each individual broadcast blocks the matrix into squared blocks 
> for direct access without unnecessary copy per task. So far this broadcast 
> creation is single-threaded. 
> This task aims to parallelize the blocking of the given in-memory matrix into 
> squared blocks 
> (https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/data/PartitionedBlock.java#L82)
>  as well as the subsequent partition creation and actual broadcasting 
> (https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/controlprogram/context/SparkExecutionContext.java#L548).
>  
> For consistency and in order to avoid excessive over-provisioning, this 
> multi-threading should use the common internal thread pool or parallel java 
> streams, which similarly calls the shared {{ForkJoinPool.commonPool}}. An 
> example is the multi-threaded parallelization of RDDs which similarly blocks 
> a given matrix into its squared blocks (see 
> https://github.com/apache/systemml/blob/master/src/main/java/org/apache/sysml/runtime/controlprogram/context/SparkExecutionContext.java#L679).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (SYSTEMML-1313) Parfor broadcast exploitation

2018-04-09 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431179#comment-16431179
 ] 

LI Guobao commented on SYSTEMML-1313:
-

Hi [~mboehm7], could you give me more details about this issue?

> Parfor broadcast exploitation
> -
>
> Key: SYSTEMML-1313
> URL: https://issues.apache.org/jira/browse/SYSTEMML-1313
> Project: SystemML
>  Issue Type: Sub-task
>  Components: APIs, Runtime
>Reporter: Matthias Boehm
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (SYSTEMML-1313) Parfor broadcast exploitation

2018-04-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao reassigned SYSTEMML-1313:
---

Assignee: LI Guobao

> Parfor broadcast exploitation
> -
>
> Key: SYSTEMML-1313
> URL: https://issues.apache.org/jira/browse/SYSTEMML-1313
> Project: SystemML
>  Issue Type: Sub-task
>  Components: APIs, Runtime
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> The parfor optimizer may decide to execute the entire loop as a remote Spark 
> job to utilize cluster parallelism. In this case all inputs to the parfor 
> body (i.e., variable that are created or read outside of the parfor body but 
> used or overwritten inside) are read from HDFS. In the past there was an 
> issue of redundant reads, which has been addressed with SYSTEMML-1879. 
> However, the direct use of Spark broadcast variables would likely improve 
> performance, especially in clusters with many nodes.
> This task aims to leverage Spark broadcast variables for all parfor inputs. 
> In detail this entails two major aspects. First, we need runtime support to 
> optionally broadcast the inputs via broadcast variables in 
> {{RemoteParForSpark}} and obtain them from these broadcast variables in 
> {{RemoteParForSparkWorker}} without causing unnecessary eviction. In 
> contrast, to the existing broadcast primitives, we don't need to blockify the 
> matrix because the matrix is accessed in full by in-memory operations. 
> Second, this requires an extension of the parfor optimizer to reason about 
> scenarios where it is safe to use broadcast because these broadcasts cause 
> additional memory requirements since they act as pinned in memory matrices. 
> This second task has likely overlap with SYSTEMML-1349 which requires a 
> similar reasoning to handle shared reads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (SYSTEMML-1313) Parfor broadcast exploitation

2018-04-20 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16445646#comment-16445646
 ] 

LI Guobao commented on SYSTEMML-1313:
-

Hi [~mboehm7], I have introduce the runtime support and I'd like to run a test 
for it. And which test can I take? Should I install a yarn cluster so that the 
parfor can be executed in mode `remote_spark`?

> Parfor broadcast exploitation
> -
>
> Key: SYSTEMML-1313
> URL: https://issues.apache.org/jira/browse/SYSTEMML-1313
> Project: SystemML
>  Issue Type: Sub-task
>  Components: APIs, Runtime
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> The parfor optimizer may decide to execute the entire loop as a remote Spark 
> job to utilize cluster parallelism. In this case all inputs to the parfor 
> body (i.e., variable that are created or read outside of the parfor body but 
> used or overwritten inside) are read from HDFS. In the past there was an 
> issue of redundant reads, which has been addressed with SYSTEMML-1879. 
> However, the direct use of Spark broadcast variables would likely improve 
> performance, especially in clusters with many nodes.
> This task aims to leverage Spark broadcast variables for all parfor inputs. 
> In detail this entails two major aspects. First, we need runtime support to 
> optionally broadcast the inputs via broadcast variables in 
> {{RemoteParForSpark}} and obtain them from these broadcast variables in 
> {{RemoteParForSparkWorker}} without causing unnecessary eviction. In 
> contrast, to the existing broadcast primitives, we don't need to blockify the 
> matrix because the matrix is accessed in full by in-memory operations. 
> Second, this requires an extension of the parfor optimizer to reason about 
> scenarios where it is safe to use broadcast because these broadcasts cause 
> additional memory requirements since they act as pinned in memory matrices. 
> This second task has likely overlap with SYSTEMML-1349 which requires a 
> similar reasoning to handle shared reads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Issue Comment Deleted] (SYSTEMML-1313) Parfor broadcast exploitation

2018-04-20 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-1313:

Comment: was deleted

(was: Hi [~mboehm7], I have introduce the runtime support and I'd like to run a 
test for it. And which test can I take? Should I install a yarn cluster so that 
the parfor can be executed in mode `remote_spark`?)

> Parfor broadcast exploitation
> -
>
> Key: SYSTEMML-1313
> URL: https://issues.apache.org/jira/browse/SYSTEMML-1313
> Project: SystemML
>  Issue Type: Sub-task
>  Components: APIs, Runtime
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> The parfor optimizer may decide to execute the entire loop as a remote Spark 
> job to utilize cluster parallelism. In this case all inputs to the parfor 
> body (i.e., variable that are created or read outside of the parfor body but 
> used or overwritten inside) are read from HDFS. In the past there was an 
> issue of redundant reads, which has been addressed with SYSTEMML-1879. 
> However, the direct use of Spark broadcast variables would likely improve 
> performance, especially in clusters with many nodes.
> This task aims to leverage Spark broadcast variables for all parfor inputs. 
> In detail this entails two major aspects. First, we need runtime support to 
> optionally broadcast the inputs via broadcast variables in 
> {{RemoteParForSpark}} and obtain them from these broadcast variables in 
> {{RemoteParForSparkWorker}} without causing unnecessary eviction. In 
> contrast, to the existing broadcast primitives, we don't need to blockify the 
> matrix because the matrix is accessed in full by in-memory operations. 
> Second, this requires an extension of the parfor optimizer to reason about 
> scenarios where it is safe to use broadcast because these broadcasts cause 
> additional memory requirements since they act as pinned in memory matrices. 
> This second task has likely overlap with SYSTEMML-1349 which requires a 
> similar reasoning to handle shared reads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (SYSTEMML-2298) Create a test dml script based on NN library

2018-05-04 Thread LI Guobao (JIRA)

LI Guobao created SYSTEMML-2298:
---

 Summary: Create a test dml script based on NN library
 Key: SYSTEMML-2298
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2298
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (SYSTEMML-2298) Create a test dml script based on NN library

2018-05-04 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao reassigned SYSTEMML-2298:
---

Assignee: LI Guobao

> Create a test dml script based on NN library
> 
>
> Key: SYSTEMML-2298
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2298
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2298) Creation of a test dml script based on NN library

2018-05-04 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2298:

Summary: Creation of a test dml script based on NN library  (was: Create a 
test dml script based on NN library)

> Creation of a test dml script based on NN library
> -
>
> Key: SYSTEMML-2298
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2298
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2298) Creation of a test dml script based on NN library

2018-05-04 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2298:

Description: During the bonding time, all the development environment 
should be well prepared. And a test dml script which leverages the new 
"paramserve" function to rewrite the training function in the [MNIST LeNet 
Example|https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet.dml]
 could be prepared.

> Creation of a test dml script based on NN library
> -
>
> Key: SYSTEMML-2298
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2298
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> During the bonding time, all the development environment should be well 
> prepared. And a test dml script which leverages the new "paramserve" function 
> to rewrite the training function in the [MNIST LeNet 
> Example|https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet.dml]
>  could be prepared.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2298) Creation of a test dml script based on NN library

2018-05-04 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2298:

Description: During the bonding time, all the development environment 
should be well prepared. And a test dml script which leverages the new 
"paramserv" function to rewrite the training function in the [MNIST LeNet 
Example|https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet.dml]
 could be prepared.  (was: During the bonding time, all the development 
environment should be well prepared. And a test dml script which leverages the 
new "paramserve" function to rewrite the training function in the [MNIST LeNet 
Example|https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet.dml]
 could be prepared.)

> Creation of a test dml script based on NN library
> -
>
> Key: SYSTEMML-2298
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2298
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> During the bonding time, all the development environment should be well 
> prepared. And a test dml script which leverages the new "paramserv" function 
> to rewrite the training function in the [MNIST LeNet 
> Example|https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet.dml]
>  could be prepared.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (SYSTEMML-2299) API design of the paramserv function

2018-05-04 Thread LI Guobao (JIRA)

LI Guobao created SYSTEMML-2299:
---

 Summary: API design of the paramserv function
 Key: SYSTEMML-2299
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2084) Language and compiler extension

2018-05-04 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2084:

Due Date: 28/May/18

> Language and compiler extension
> ---
>
> Key: SYSTEMML-2084
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2084
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2298) Creation of a test dml script based on NN library

2018-05-04 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2298:

Due Date: 14/May/18

> Creation of a test dml script based on NN library
> -
>
> Key: SYSTEMML-2298
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2298
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> During the bonding time, all the development environment should be well 
> prepared. And a test dml script which leverages the new "paramserv" function 
> to rewrite the training function in the [MNIST LeNet 
> Example|https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet.dml]
>  could be prepared.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2084) Implementation of language and compiler extension

2018-05-04 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2084:

Summary: Implementation of language and compiler extension  (was: Language 
and compiler extension)

> Implementation of language and compiler extension
> -
>
> Key: SYSTEMML-2084
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2084
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (SYSTEMML-2084) Implementation of language and compiler extension

2018-05-04 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao reassigned SYSTEMML-2084:
---

Assignee: LI Guobao

> Implementation of language and compiler extension
> -
>
> Key: SYSTEMML-2084
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2084
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (SYSTEMML-2299) API design of the paramserv function

2018-05-04 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao reassigned SYSTEMML-2299:
---

Assignee: LI Guobao

> API design of the paramserv function
> 
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives

2018-05-04 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:

Due Date: 4/Jun/18
 Summary: Single-node parameter server primitives  (was: Basic runtime 
primitives)

> Single-node parameter server primitives
> ---
>
> Key: SYSTEMML-2085
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (SYSTEMML-2085) Single-node parameter server primitives

2018-05-04 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao reassigned SYSTEMML-2085:
---

Assignee: LI Guobao

> Single-node parameter server primitives
> ---
>
> Key: SYSTEMML-2085
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (SYSTEMML-2300) First evaluation

2018-05-04 Thread LI Guobao (JIRA)

LI Guobao created SYSTEMML-2300:
---

 Summary: First evaluation
 Key: SYSTEMML-2300
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2300
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (SYSTEMML-2086) Initial version of local backend

2018-05-04 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao reassigned SYSTEMML-2086:
---

Assignee: LI Guobao

> Initial version of local backend
> 
>
> Key: SYSTEMML-2086
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2086
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2086) Initial version of local backend

2018-05-04 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2086:

Summary: Initial version of local backend  (was: Local, multi-threaded 
backend)

> Initial version of local backend
> 
>
> Key: SYSTEMML-2086
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2086
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2086) Initial version of local backend

2018-05-04 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2086:

Due Date: 25/Jun/18

> Initial version of local backend
> 
>
> Key: SYSTEMML-2086
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2086
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (SYSTEMML-2087) Initial version of distributed spark backend

2018-05-04 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao reassigned SYSTEMML-2087:
---

Assignee: LI Guobao
Due Date: 9/Jul/18
 Summary: Initial version of distributed spark backend  (was: Distributed 
spark backend)

> Initial version of distributed spark backend
> 
>
> Key: SYSTEMML-2087
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2087
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (SYSTEMML-2301) Second evaluation

2018-05-04 Thread LI Guobao (JIRA)

LI Guobao created SYSTEMML-2301:
---

 Summary: Second evaluation
 Key: SYSTEMML-2301
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2301
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (SYSTEMML-2302) Second version of execution backend

2018-05-04 Thread LI Guobao (JIRA)

LI Guobao created SYSTEMML-2302:
---

 Summary: Second version of execution backend
 Key: SYSTEMML-2302
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2302
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (SYSTEMML-2303) Integration test, implementation of samples and documentation

2018-05-04 Thread LI Guobao (JIRA)

LI Guobao created SYSTEMML-2303:
---

 Summary: Integration test, implementation of samples and 
documentation 
 Key: SYSTEMML-2303
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2303
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (SYSTEMML-2304) Submit final product

2018-05-04 Thread LI Guobao (JIRA)

LI Guobao created SYSTEMML-2304:
---

 Summary: Submit final product
 Key: SYSTEMML-2304
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2304
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (SYSTEMML-2083) Language and runtime for parameter servers

2018-05-04 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao reassigned SYSTEMML-2083:
---

Assignee: LI Guobao

> Language and runtime for parameter servers
> --
>
> Key: SYSTEMML-2083
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2083
> Project: SystemML
>  Issue Type: Epic
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>  Labels: gsoc2018
> Attachments: image-2018-02-14-12-18-48-932.png, 
> image-2018-02-14-12-21-00-932.png, image-2018-02-14-12-31-37-563.png
>
>
> SystemML already provides a rich set of execution strategies ranging from 
> local operations to large-scale computation on MapReduce or Spark. In this 
> context, we support both data-parallel (multi-threaded or distributed 
> operations) as well as task-parallel computation (multi-threaded or 
> distributed parfor loops). This epic aims to complement the existing 
> execution strategies by language and runtime primitives for parameter 
> servers, i.e., model-parallel execution. We use the terminology of 
> model-parallel execution with distributed data and distributed model to 
> differentiate them from the existing data-parallel operations. Target 
> applications are distributed deep learning and mini-batch algorithms in 
> general. These new abstractions will help making SystemML a unified framework 
> for small- and large-scale machine learning that supports all three major 
> execution strategies in a single framework.
>  
> A major challenge is the integration of stateful parameter servers and their 
> common push/pull primitives into an otherwise functional (and thus, 
> stateless) language. We will approach this challenge via a new builtin 
> function {{paramserv}} which internally maintains state but at the same time 
> fits into the runtime framework of stateless operations.
> Furthermore, we are interested in providing (1) different runtime backends 
> (local and distributed), (2) different parameter server modes (synchronous, 
> asynchronous, hogwild!, stale-synchronous), (3) different update frequencies 
> (batch, multi-batch, epoch), as well as (4) different architectures for 
> distributed data (1 parameter server, k workers) and distributed model (k1 
> parameter servers, k2 workers). 
>  
> *Note for GSOC students:* This is large project which will be broken down 
> into sub projects, so everybody will be having their share of pie.
> *Prerequistes:* Java, machine learning experience is a plus but not required.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function

2018-05-05 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2299:

Description: The objective of “paramserv” built-in function is to update an 
initial or existing model with configuration. An initial function signature is 
illustrated in Figure 1. We are interested in providing the model, the training 
features and labels, the validation features and labels, the batch update 
function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), 
the update frequency (e.g. epoch or batch), the aggregation function, the 
number of epoch, the batch size, the degree of parallelism as well as the 
checkpointing strategy (e.g. rollback recovery).

> API design of the paramserv function
> 
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> The objective of “paramserv” built-in function is to update an initial or 
> existing model with configuration. An initial function signature is 
> illustrated in Figure 1. We are interested in providing the model, the 
> training features and labels, the validation features and labels, the batch 
> update function, the update strategy (e.g. sync, async, hogwild!, 
> stale-synchronous), the update frequency (e.g. epoch or batch), the 
> aggregation function, the number of epoch, the batch size, the degree of 
> parallelism as well as the checkpointing strategy (e.g. rollback recovery).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function

2018-05-05 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2299:

Description: The objective of “paramserv” built-in function is to update an 
initial or existing model with configuration. An initial function signature is 
illustrated in Figure 1. We are interested in providing the model, the training 
features and labels, the validation features and labels, the gradient 
calculation function, the batch update function, the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
batch), the aggregation function, the number of epoch, the batch size, the 
degree of parallelism as well as the checkpointing strategy (e.g. rollback 
recovery).  (was: The objective of “paramserv” built-in function is to update 
an initial or existing model with configuration. An initial function signature 
is illustrated in Figure 1. We are interested in providing the model, the 
training features and labels, the validation features and labels, the batch 
update function, the update strategy (e.g. sync, async, hogwild!, 
stale-synchronous), the update frequency (e.g. epoch or batch), the aggregation 
function, the number of epoch, the batch size, the degree of parallelism as 
well as the checkpointing strategy (e.g. rollback recovery).)

> API design of the paramserv function
> 
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> The objective of “paramserv” built-in function is to update an initial or 
> existing model with configuration. An initial function signature is 
> illustrated in Figure 1. We are interested in providing the model, the 
> training features and labels, the validation features and labels, the 
> gradient calculation function, the batch update function, the update strategy 
> (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. 
> epoch or batch), the aggregation function, the number of epoch, the batch 
> size, the degree of parallelism as well as the checkpointing strategy (e.g. 
> rollback recovery).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function

2018-05-05 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2299:

Description: The objective of “paramserv” built-in function is to update an 
initial or existing model with configuration. An initial function signature 
would be _model'=paramserv(model, X, y, X_val, y_val, g_cal_fun, upd=fun1, 
mode=SYNC, freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, 
checkpointing=rollback)_. We are interested in providing the model, the 
training features and labels, the validation features and labels, the gradient 
calculation function, the batch update function, the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
batch), the aggregation function, the number of epoch, the batch size, the 
degree of parallelism as well as the checkpointing strategy (e.g. rollback 
recovery).  (was: The objective of “paramserv” built-in function is to update 
an initial or existing model with configuration. An initial function signature 
is illustrated in Figure 1. We are interested in providing the model, the 
training features and labels, the validation features and labels, the gradient 
calculation function, the batch update function, the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
batch), the aggregation function, the number of epoch, the batch size, the 
degree of parallelism as well as the checkpointing strategy (e.g. rollback 
recovery).)

> API design of the paramserv function
> 
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> The objective of “paramserv” built-in function is to update an initial or 
> existing model with configuration. An initial function signature would be 
> _model'=paramserv(model, X, y, X_val, y_val, g_cal_fun, upd=fun1, mode=SYNC, 
> freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, 
> checkpointing=rollback)_. We are interested in providing the model, the 
> training features and labels, the validation features and labels, the 
> gradient calculation function, the batch update function, the update strategy 
> (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. 
> epoch or batch), the aggregation function, the number of epoch, the batch 
> size, the degree of parallelism as well as the checkpointing strategy (e.g. 
> rollback recovery).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2084) Implementation of language and compiler extension

2018-05-05 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2084:

Description: This part aims to add an additional language support for the 
“paramserv” function in order to be able to compile this new function. Since 
SystemML already supports the parameterized builtin function, we can easily 
extend an additional operation type and generate a new instruction for the 
“paramserv” function. Recently, we have also added a new “eval” built-in 
function which is capable to pass a function pointer as argument so that it can 
be called in runtime. Similar to it, we would need to extend the 
inter-procedural analysis to avoid removing unused constructed functions in the 
presence of second-order “paramserv” function. Because the referenced 
functions, i.e., the aggregate function and update function, should be present 
in runtime.

> Implementation of language and compiler extension
> -
>
> Key: SYSTEMML-2084
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2084
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> This part aims to add an additional language support for the “paramserv” 
> function in order to be able to compile this new function. Since SystemML 
> already supports the parameterized builtin function, we can easily extend an 
> additional operation type and generate a new instruction for the “paramserv” 
> function. Recently, we have also added a new “eval” built-in function which 
> is capable to pass a function pointer as argument so that it can be called in 
> runtime. Similar to it, we would need to extend the inter-procedural analysis 
> to avoid removing unused constructed functions in the presence of 
> second-order “paramserv” function. Because the referenced functions, i.e., 
> the aggregate function and update function, should be present in runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives

2018-05-05 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:

Description: 
Parameter server allows to persist the model parameters in a distributed 
manner. It is specially applied in the context of large-scale machine learning 
to train the model. The parameters computation will be done with data 
parallelism across the workers. The data-parallel parameter server architecture 
is illustrated in Figure 2. With the help
of a lightweight parameter server interface [1], we are inspired to provide the 
push and pull methods as internal primitives, i.e., not exposed to the script 
level, allowing to exchange the intermediates among workers.

> Single-node parameter server primitives
> ---
>
> Key: SYSTEMML-2085
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> Parameter server allows to persist the model parameters in a distributed 
> manner. It is specially applied in the context of large-scale machine 
> learning to train the model. The parameters computation will be done with 
> data parallelism across the workers. The data-parallel parameter server 
> architecture is illustrated in Figure 2. With the help
> of a lightweight parameter server interface [1], we are inspired to provide 
> the push and pull methods as internal primitives, i.e., not exposed to the 
> script level, allowing to exchange the intermediates among workers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives

2018-05-05 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:

Description: A single node parameter server acts as a data-parallel 
parameter server. And a multi-node model parallel parameter server will be 
discussed if time permits.   (was: Parameter server allows to persist the model 
parameters in a distributed manner. It is specially applied in the context of 
large-scale machine learning to train the model. The parameters computation 
will be done with data parallelism across the workers. The data-parallel 
parameter server architecture is illustrated in Figure 2. With the help
of a lightweight parameter server interface [1], we are inspired to provide the 
push and pull methods as internal primitives, i.e., not exposed to the script 
level, allowing to exchange the intermediates among workers.)

> Single-node parameter server primitives
> ---
>
> Key: SYSTEMML-2085
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> A single node parameter server acts as a data-parallel parameter server. And 
> a multi-node model parallel parameter server will be discussed if time 
> permits. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives

2018-05-05 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:

Description: A single node parameter server acts as a data-parallel 
parameter server. And a multi-node model parallel parameter server will be 
discussed if time permits. The idea is to run a single-node parameter server by 
maintaining a hashmap inside the CP (Control Program) where the parameter as 
value accompanied with a defined key. For example, inserting the global 
parameter with a key named “worker-param-replica” allows the workers to 
retrieve the parameter replica. Hence, in the context of local multi-threaded 
backend, workers can communicate directly with this hashmap in the same 
process. And in the context of Spark distributed backend, the CP firstly needs 
to fork a thread to start a parameter server which maintains a hashmap. And 
secondly the workers can send intermediates and retrieve parameters by 
connecting to parameter server via TCP socket. Since SystemML has good cache 
management, we only need to maintain the matrix reference pointing to a file 
location instead of real data instance in the hashmap. If time permits, to be 
able to introduce the async and staleness update strategies, we would need to 
implement the synchronization by leveraging vector clock.  (was: A single node 
parameter server acts as a data-parallel parameter server. And a multi-node 
model parallel parameter server will be discussed if time permits. )

> Single-node parameter server primitives
> ---
>
> Key: SYSTEMML-2085
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> A single node parameter server acts as a data-parallel parameter server. And 
> a multi-node model parallel parameter server will be discussed if time 
> permits. The idea is to run a single-node parameter server by maintaining a 
> hashmap inside the CP (Control Program) where the parameter as value 
> accompanied with a defined key. For example, inserting the global parameter 
> with a key named “worker-param-replica” allows the workers to retrieve the 
> parameter replica. Hence, in the context of local multi-threaded backend, 
> workers can communicate directly with this hashmap in the same process. And 
> in the context of Spark distributed backend, the CP firstly needs to fork a 
> thread to start a parameter server which maintains a hashmap. And secondly 
> the workers can send intermediates and retrieve parameters by connecting to 
> parameter server via TCP socket. Since SystemML has good cache management, we 
> only need to maintain the matrix reference pointing to a file location 
> instead of real data instance in the hashmap. If time permits, to be able to 
> introduce the async and staleness update strategies, we would need to 
> implement the synchronization by leveraging vector clock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives

2018-05-05 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:

Attachment: ps.png

> Single-node parameter server primitives
> ---
>
> Key: SYSTEMML-2085
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
> Attachments: ps.png
>
>
> A single node parameter server acts as a data-parallel parameter server. And 
> a multi-node model parallel parameter server will be discussed if time 
> permits. The idea is to run a single-node parameter server by maintaining a 
> hashmap inside the CP (Control Program) where the parameter as value 
> accompanied with a defined key. For example, inserting the global parameter 
> with a key named “worker-param-replica” allows the workers to retrieve the 
> parameter replica. Hence, in the context of local multi-threaded backend, 
> workers can communicate directly with this hashmap in the same process. And 
> in the context of Spark distributed backend, the CP firstly needs to fork a 
> thread to start a parameter server which maintains a hashmap. And secondly 
> the workers can send intermediates and retrieve parameters by connecting to 
> parameter server via TCP socket. Since SystemML has good cache management, we 
> only need to maintain the matrix reference pointing to a file location 
> instead of real data instance in the hashmap. If time permits, to be able to 
> introduce the async and staleness update strategies, we would need to 
> implement the synchronization by leveraging vector clock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2299:

Description: The objective of “paramserv” built-in function is to update an 
initial or existing model with configuration. An initial function signature 
would be _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, 
freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. 
We are interested in providing the model (which will be a struct-like data 
structure consisting the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function, the update strategy (e.g. sync, async, hogwild!, 
stale-synchronous), the update frequency (e.g. epoch or mini-batch), the 
gradient aggregation function, the number of epoch, the batch size, the degree 
of parallelism as well as the checkpointing strategy (e.g. rollback recovery). 
And the function will return a trained model in format of struct.  (was: The 
objective of “paramserv” built-in function is to update an initial or existing 
model with configuration. An initial function signature would be 
_model'=paramserv(model, X, y, X_val, y_val, g_cal_fun, upd=fun1, mode=SYNC, 
freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. 
We are interested in providing the model, the training features and labels, the 
validation features and labels, the gradient calculation function, the batch 
update function, the update strategy (e.g. sync, async, hogwild!, 
stale-synchronous), the update frequency (e.g. epoch or batch), the aggregation 
function, the number of epoch, the batch size, the degree of parallelism as 
well as the checkpointing strategy (e.g. rollback recovery).)

> API design of the paramserv function
> 
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> The objective of “paramserv” built-in function is to update an initial or 
> existing model with configuration. An initial function signature would be 
> _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, 
> agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are 
> interested in providing the model (which will be a struct-like data structure 
> consisting the weights, the biases and the hyperparameters), the training 
> features and labels, the validation features and labels, the batch update 
> function, the update strategy (e.g. sync, async, hogwild!, 
> stale-synchronous), the update frequency (e.g. epoch or mini-batch), the 
> gradient aggregation function, the number of epoch, the batch size, the 
> degree of parallelism as well as the checkpointing strategy (e.g. rollback 
> recovery). And the function will return a trained model in format of struct.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2299:

Description: The objective of “paramserv” built-in function is to update an 
initial or existing model with configuration. An initial function signature 
would be _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, 
freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. 
We are interested in providing the model (which will be a struct-like data 
structure consisting the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function, the update strategy (e.g. sync, async, hogwild!, 
stale-synchronous), the update frequency (e.g. epoch or mini-batch), the 
gradient aggregation function, the number of epoch, the batch size, the degree 
of parallelism as well as the checkpointing strategy (e.g. rollback recovery). 
And the function will return a trained model in struct format.  (was: The 
objective of “paramserv” built-in function is to update an initial or existing 
model with configuration. An initial function signature would be 
_model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, 
agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are 
interested in providing the model (which will be a struct-like data structure 
consisting the weights, the biases and the hyperparameters), the training 
features and labels, the validation features and labels, the batch update 
function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), 
the update frequency (e.g. epoch or mini-batch), the gradient aggregation 
function, the number of epoch, the batch size, the degree of parallelism as 
well as the checkpointing strategy (e.g. rollback recovery). And the function 
will return a trained model in format of struct.)

> API design of the paramserv function
> 
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> The objective of “paramserv” built-in function is to update an initial or 
> existing model with configuration. An initial function signature would be 
> _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, 
> agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are 
> interested in providing the model (which will be a struct-like data structure 
> consisting the weights, the biases and the hyperparameters), the training 
> features and labels, the validation features and labels, the batch update 
> function, the update strategy (e.g. sync, async, hogwild!, 
> stale-synchronous), the update frequency (e.g. epoch or mini-batch), the 
> gradient aggregation function, the number of epoch, the batch size, the 
> degree of parallelism as well as the checkpointing strategy (e.g. rollback 
> recovery). And the function will return a trained model in struct format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2298) Preparation of dev environment

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2298:

Summary: Preparation of dev environment  (was: Creation of a test dml 
script based on NN library)

> Preparation of dev environment
> --
>
> Key: SYSTEMML-2298
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2298
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> During the bonding time, all the development environment should be well 
> prepared. And a test dml script which leverages the new "paramserv" function 
> to rewrite the training function in the [MNIST LeNet 
> Example|https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet.dml]
>  could be prepared.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2298) Preparation of dev environment

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2298:

Description: During the bonding time, all the development environment 
should be well prepared. The native library OpenBLAS should be installed in 
order to run the MNIST LeNet example. And then by leveraging the MNIST LeNet 
data generator ([http://leon.bottou.org/projects/infimnist]), we could generate 
256k instances to train the model.  (was: During the bonding time, all the 
development environment should be well prepared. And a test dml script which 
leverages the new "paramserv" function to rewrite the training function in the 
[MNIST LeNet 
Example|https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet.dml]
 could be prepared.)

> Preparation of dev environment
> --
>
> Key: SYSTEMML-2298
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2298
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> During the bonding time, all the development environment should be well 
> prepared. The native library OpenBLAS should be installed in order to run the 
> MNIST LeNet example. And then by leveraging the MNIST LeNet data generator 
> ([http://leon.bottou.org/projects/infimnist]), we could generate 256k 
> instances to train the model.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (SYSTEMML-2306) Implementation of a script with paramserv func

2018-05-09 Thread LI Guobao (JIRA)

LI Guobao created SYSTEMML-2306:
---

 Summary: Implementation of a script with paramserv func
 Key: SYSTEMML-2306
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2306
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao


This task aims to write a dml script consisting the paramserv function. We 
could easily reuse the MNIST LeNet example and adapt it by creating a 
struct-like model and passing the update function as well as the aggregation 
function. In this case, the update function which will be executed in workers 
should consist of calculating the gradients by walking the batch forward and 
backward steps. And the aggregation function which will be runned in parameter 
server should consist of updating the weights and biases by aggregating the 
received gradients.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2299:

Due Date: 17/May/18  (was: 21/May/18)

> API design of the paramserv function
> 
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> The objective of “paramserv” built-in function is to update an initial or 
> existing model with configuration. An initial function signature would be 
> _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, 
> agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are 
> interested in providing the model (which will be a struct-like data structure 
> consisting the weights, the biases and the hyperparameters), the training 
> features and labels, the validation features and labels, the batch update 
> function, the update strategy (e.g. sync, async, hogwild!, 
> stale-synchronous), the update frequency (e.g. epoch or mini-batch), the 
> gradient aggregation function, the number of epoch, the batch size, the 
> degree of parallelism as well as the checkpointing strategy (e.g. rollback 
> recovery). And the function will return a trained model in struct format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2299:

Due Date: 16/May/18  (was: 17/May/18)

> API design of the paramserv function
> 
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> The objective of “paramserv” built-in function is to update an initial or 
> existing model with configuration. An initial function signature would be 
> _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, 
> agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are 
> interested in providing the model (which will be a struct-like data structure 
> consisting the weights, the biases and the hyperparameters), the training 
> features and labels, the validation features and labels, the batch update 
> function, the update strategy (e.g. sync, async, hogwild!, 
> stale-synchronous), the update frequency (e.g. epoch or mini-batch), the 
> gradient aggregation function, the number of epoch, the batch size, the 
> degree of parallelism as well as the checkpointing strategy (e.g. rollback 
> recovery). And the function will return a trained model in struct format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2306) Implementation of a script with paramserv func

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2306:

Due Date: 18/May/18

> Implementation of a script with paramserv func
> --
>
> Key: SYSTEMML-2306
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2306
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> This task aims to write a dml script consisting the paramserv function. We 
> could easily reuse the MNIST LeNet example and adapt it by creating a 
> struct-like model and passing the update function as well as the aggregation 
> function. In this case, the update function which will be executed in workers 
> should consist of calculating the gradients by walking the batch forward and 
> backward steps. And the aggregation function which will be runned in 
> parameter server should consist of updating the weights and biases by 
> aggregating the received gradients.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:

Description: 
A single node parameter server acts as a data-parallel parameter server. And a 
multi-node model parallel parameter server will be discussed if time permits. 
 # For the case of local multi-thread parameter server, it is easy to maintain 
a concurrent hashmap (where the parameters as value accompanied with a defined 
key) inside the CP. And the workers are launched in multi-threaded way to 
execute the gradients calculation function and push the gradients to the 
hashmap. An another thread will be launched to pull the gradients from hashmap 
and call the aggregation function to update the parameters. 
 # For the case of spark distributed backend, we could launch a remote single 
parameter server outside of CP (as a worker) to provide the pull and push 
service. For the moment, all the weights and biases are saved in this single 
server. And the exchange between server and workers will be implemented by TCP. 
Hence, we could easily broadcast the IP address and the port number to the 
workers. And then the workers can send the gradients and retrieve the new 
parameters via TCP socket. 

We could also need to implement the synchronisation between workers and 
parameter server to be able to bring more parameter update strategies, e.g., 
the stale-synchronous strategy needs a hyperparameter "staleness" to define the 
waiting interval. The idea is to maintain a vector clock consisting of all 
workers' clock in the server. Each time when an iteration finishes, the worker 
will send a request to server and then the server will send back a response to 
indicate if the worker should wait or not.

A diagram of the parameter server architecture is shown below.

  was:
A single node parameter server acts as a data-parallel parameter server. And a 
multi-node model parallel parameter server will be discussed if time permits. 
 # For the case of local multi-thread parameter server, it is easy to maintain 
a concurrent hashmap (where the parameters as value accompanied with a defined 
key) inside the CP. And the workers are launched in multi-threaded way to 
execute the gradients calculation function and push the gradients to the 
hashmap. An another thread will be launched to pull the gradients from hashmap 
and call the aggregation function to update the parameters. 
 # For the case of spark distributed backend, we could launch a remote single 
parameter server outside of CP (as a worker) to provide the pull and push 
service. For the moment, all the weights and biases are saved in this single 
server. And the exchange between server and workers will be implemented by TCP. 
Hence, we could easily broadcast the IP address and the port number to the 
workers. And then the workers can send the gradients and retrieve the new 
parameters via TCP socket. 

We could also need to implement the synchronisation between workers and 
parameter server to be able to bring more parameter update strategies, e.g., 
the stale-synchronous strategy needs a hyperparameter "staleness" to define the 
waiting interval. The idea is to maintain a vector clock consisting of all 
workers' clock in the server. Each time when an iteration finishes, the worker 
will send a request to server and then the server will send back a response to 
indicate if the worker should wait or not.


> Single-node parameter server primitives
> ---
>
> Key: SYSTEMML-2085
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
> Attachments: ps.png
>
>
> A single node parameter server acts as a data-parallel parameter server. And 
> a multi-node model parallel parameter server will be discussed if time 
> permits. 
>  # For the case of local multi-thread parameter server, it is easy to 
> maintain a concurrent hashmap (where the parameters as value accompanied with 
> a defined key) inside the CP. And the workers are launched in multi-threaded 
> way to execute the gradients calculation function and push the gradients to 
> the hashmap. An another thread will be launched to pull the gradients from 
> hashmap and call the aggregation function to update the parameters. 
>  # For the case of spark distributed backend, we could launch a remote single 
> parameter server outside of CP (as a worker) to provide the pull and push 
> service. For the moment, all the weights and biases are saved in this single 
> server. And the exchange between server and workers will be implemented by 
> TCP. Hence, we could easily broadcast the IP address and the port number to 
> the workers. And then the workers can send the gradients and re

[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:

Description: 
A single node parameter server acts as a data-parallel parameter server. And a 
multi-node model parallel parameter server will be discussed if time permits. 
 # For the case of local multi-thread parameter server, it is easy to maintain 
a concurrent hashmap (where the parameters as value accompanied with a defined 
key) inside the CP. And the workers are launched in multi-threaded way to 
execute the gradients calculation function and push the gradients to the 
hashmap. An another thread will be launched to pull the gradients from hashmap 
and call the aggregation function to update the parameters. 
 # For the case of spark distributed backend, we could launch a remote single 
parameter server outside of CP (as a worker) to provide the pull and push 
service. For the moment, all the weights and biases are saved in this single 
server. And the exchange between server and workers will be implemented by TCP. 
Hence, we could easily broadcast the IP address and the port number to the 
workers. And then the workers can send the gradients and retrieve the new 
parameters via TCP socket. 

We could also need to implement the synchronisation between workers and 
parameter server to be able to bring more parameter update strategies, e.g., 
the stale-synchronous strategy needs a hyperparameter "staleness" to define the 
waiting interval. The idea is to maintain a vector clock consisting of all 
workers' clock in the server. Each time when an iteration finishes, the worker 
will send a request to server and then the server will send back a response to 
indicate if the worker should wait or not.

  was:A single node parameter server acts as a data-parallel parameter server. 
And a multi-node model parallel parameter server will be discussed if time 
permits. The idea is to run a single-node parameter server by maintaining a 
hashmap inside the CP (Control Program) where the parameter as value 
accompanied with a defined key. For example, inserting the global parameter 
with a key named “worker-param-replica” allows the workers to retrieve the 
parameter replica. Hence, in the context of local multi-threaded backend, 
workers can communicate directly with this hashmap in the same process. And in 
the context of Spark distributed backend, the CP firstly needs to fork a thread 
to start a parameter server which maintains a hashmap. And secondly the workers 
can send intermediates and retrieve parameters by connecting to parameter 
server via TCP socket. Since SystemML has good cache management, we only need 
to maintain the matrix reference pointing to a file location instead of real 
data instance in the hashmap. If time permits, to be able to introduce the 
async and staleness update strategies, we would need to implement the 
synchronization by leveraging vector clock.


> Single-node parameter server primitives
> ---
>
> Key: SYSTEMML-2085
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
> Attachments: ps.png
>
>
> A single node parameter server acts as a data-parallel parameter server. And 
> a multi-node model parallel parameter server will be discussed if time 
> permits. 
>  # For the case of local multi-thread parameter server, it is easy to 
> maintain a concurrent hashmap (where the parameters as value accompanied with 
> a defined key) inside the CP. And the workers are launched in multi-threaded 
> way to execute the gradients calculation function and push the gradients to 
> the hashmap. An another thread will be launched to pull the gradients from 
> hashmap and call the aggregation function to update the parameters. 
>  # For the case of spark distributed backend, we could launch a remote single 
> parameter server outside of CP (as a worker) to provide the pull and push 
> service. For the moment, all the weights and biases are saved in this single 
> server. And the exchange between server and workers will be implemented by 
> TCP. Hence, we could easily broadcast the IP address and the port number to 
> the workers. And then the workers can send the gradients and retrieve the new 
> parameters via TCP socket. 
> We could also need to implement the synchronisation between workers and 
> parameter server to be able to bring more parameter update strategies, e.g., 
> the stale-synchronous strategy needs a hyperparameter "staleness" to define 
> the waiting interval. The idea is to maintain a vector clock consisting of 
> all workers' clock in the server. Each time when an iteration finishes, the 
> worker will send a request to server and then the

[jira] [Updated] (SYSTEMML-2086) Initial version of local backend

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2086:

Description: 
This part aims to design and implement a local execution backend for the 
compiled “paramserv” function. It consists of the implementations of 
partitioning the data for worker threads, launching the single-node parameter 
server in CP, shipping and calling the compiled statistical function and 
creating different update strategies. We will focus on
implementing BSP execution strategies, i.e., synchronous update strategy 
including per epoch and per batch. And other update strategies (e.g. 
asynchronous, stale-synchronous) and checkpointing strategies should be 
optional and will be added if time permits. The architecture for synchronous 
per epoch update strategy is illustrated below.

> Initial version of local backend
> 
>
> Key: SYSTEMML-2086
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2086
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> This part aims to design and implement a local execution backend for the 
> compiled “paramserv” function. It consists of the implementations of 
> partitioning the data for worker threads, launching the single-node parameter 
> server in CP, shipping and calling the compiled statistical function and 
> creating different update strategies. We will focus on
> implementing BSP execution strategies, i.e., synchronous update strategy 
> including per epoch and per batch. And other update strategies (e.g. 
> asynchronous, stale-synchronous) and checkpointing strategies should be 
> optional and will be added if time permits. The architecture for synchronous 
> per epoch update strategy is illustrated below.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2086) Initial version of local backend

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2086:

Description: 
This part aims to design and implement a local execution backend for the 
compiled “paramserv” function. It consists of the implementations of 
partitioning the data for worker threads, launching the single-node parameter 
server in CP, shipping and calling the compiled statistical function and 
creating different update strategies. We will focus on
 implementing BSP execution strategies, i.e., synchronous update strategy 
including per epoch and per batch. And other update strategies (e.g. 
asynchronous, stale-synchronous) and checkpointing strategies should be 
optional and will be added if time permits. The architecture for synchronous 
per epoch update strategy is illustrated below.

The idea is to spawn a thread to launch local parameter server which is 
responsible for maintaining the parameter hashmap and executing the aggregation 
work. And then a number of workers will be forked according to the level of 
parallelism. The worker loads data partition, operates the parameter updating 
per batch, pushes the gradients and retrieves a new parameter from server. The 
server will retrieve the gradients of each worker using the related keys in a 
round robin way, aggregate the parameters and push the new global parameter 
with the parameter related keys. At last, the paramserv function main thread 
should wait for the server aggregator thread joining it and got the last global 
parameters as final result. Hence, the pull/push primitive methods can bring 
more flexibility and facilitate to implement other update strategies.

  was:
This part aims to design and implement a local execution backend for the 
compiled “paramserv” function. It consists of the implementations of 
partitioning the data for worker threads, launching the single-node parameter 
server in CP, shipping and calling the compiled statistical function and 
creating different update strategies. We will focus on
implementing BSP execution strategies, i.e., synchronous update strategy 
including per epoch and per batch. And other update strategies (e.g. 
asynchronous, stale-synchronous) and checkpointing strategies should be 
optional and will be added if time permits. The architecture for synchronous 
per epoch update strategy is illustrated below.


> Initial version of local backend
> 
>
> Key: SYSTEMML-2086
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2086
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> This part aims to design and implement a local execution backend for the 
> compiled “paramserv” function. It consists of the implementations of 
> partitioning the data for worker threads, launching the single-node parameter 
> server in CP, shipping and calling the compiled statistical function and 
> creating different update strategies. We will focus on
>  implementing BSP execution strategies, i.e., synchronous update strategy 
> including per epoch and per batch. And other update strategies (e.g. 
> asynchronous, stale-synchronous) and checkpointing strategies should be 
> optional and will be added if time permits. The architecture for synchronous 
> per epoch update strategy is illustrated below.
> The idea is to spawn a thread to launch local parameter server which is 
> responsible for maintaining the parameter hashmap and executing the 
> aggregation work. And then a number of workers will be forked according to 
> the level of parallelism. The worker loads data partition, operates the 
> parameter updating per batch, pushes the gradients and retrieves a new 
> parameter from server. The server will retrieve the gradients of each worker 
> using the related keys in a round robin way, aggregate the parameters and 
> push the new global parameter with the parameter related keys. At last, the 
> paramserv function main thread should wait for the server aggregator thread 
> joining it and got the last global parameters as final result. Hence, the 
> pull/push primitive methods can bring more flexibility and facilitate to 
> implement other update strategies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:

Description: 
A single node parameter server acts as a data-parallel parameter server. And a 
multi-node model parallel parameter server will be discussed if time permits. 

Push/Pull service: 

In general, we could launch a parameter server inside (local multi-thread 
backend) or outside (spark distributed backend) of CP to provide the pull and 
push service. For the moment, all the weights and biases are saved in a hashmap 
using a key, e.g., "global parameter". Each worker's gradients will be put into 
the hashmap seperately with a given key. And the exchange between server and 
workers will be implemented by TCP. Hence, we could easily broadcast the IP 
address and the port number to the workers. And then the workers can send the 
gradients and retrieve the new parameters via TCP socket. The server will also 
spawn a thread which retrieves the gradients by polling the hashmap using 
relevant keys and aggregates them. At last, it updates the global parameter in 
the hashmap.

Synchronization:

We also need to implement the synchronization between workers and parameter 
server to be able to bring more parameter update strategies, e.g., the 
stale-synchronous strategy needs a hyperparameter "staleness" to define the 
waiting interval. The idea is to maintain a vector clock recording all workers' 
clock in the server. Each time when an iteration in side of worker finishes, it 
waits server to give a signal, i.e., to send a request for calculating the 
staleness according to the vector clock. And when the server receives the 
gradients from certain worker, it will increment the vector clock for this 
worker. So we could define BSP as "staleness==0", ASP as "staleness==-1" and 
SSP as "staleness==N".

A diagram of the parameter server architecture is shown below.

  was:
A single node parameter server acts as a data-parallel parameter server. And a 
multi-node model parallel parameter server will be discussed if time permits. 
 # For the case of local multi-thread parameter server, it is easy to maintain 
a concurrent hashmap (where the parameters as value accompanied with a defined 
key) inside the CP. And the workers are launched in multi-threaded way to 
execute the gradients calculation function and push the gradients to the 
hashmap. An another thread will be launched to pull the gradients from hashmap 
and call the aggregation function to update the parameters. 
 # For the case of spark distributed backend, we could launch a remote single 
parameter server outside of CP (as a worker) to provide the pull and push 
service. For the moment, all the weights and biases are saved in this single 
server. And the exchange between server and workers will be implemented by TCP. 
Hence, we could easily broadcast the IP address and the port number to the 
workers. And then the workers can send the gradients and retrieve the new 
parameters via TCP socket. 

We could also need to implement the synchronisation between workers and 
parameter server to be able to bring more parameter update strategies, e.g., 
the stale-synchronous strategy needs a hyperparameter "staleness" to define the 
waiting interval. The idea is to maintain a vector clock consisting of all 
workers' clock in the server. Each time when an iteration finishes, the worker 
will send a request to server and then the server will send back a response to 
indicate if the worker should wait or not.

A diagram of the parameter server architecture is shown below.


> Single-node parameter server primitives
> ---
>
> Key: SYSTEMML-2085
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
> Attachments: ps.png
>
>
> A single node parameter server acts as a data-parallel parameter server. And 
> a multi-node model parallel parameter server will be discussed if time 
> permits. 
> Push/Pull service: 
> In general, we could launch a parameter server inside (local multi-thread 
> backend) or outside (spark distributed backend) of CP to provide the pull and 
> push service. For the moment, all the weights and biases are saved in a 
> hashmap using a key, e.g., "global parameter". Each worker's gradients will 
> be put into the hashmap seperately with a given key. And the exchange between 
> server and workers will be implemented by TCP. Hence, we could easily 
> broadcast the IP address and the port number to the workers. And then the 
> workers can send the gradients and retrieve the new parameters via TCP 
> socket. The server will also spawn a thread which retrieves the gradients by 
> polling the hashmap using relevant k

[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:

Attachment: ps.png

> Single-node parameter server primitives
> ---
>
> Key: SYSTEMML-2085
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
> Attachments: ps.png
>
>
> A single node parameter server acts as a data-parallel parameter server. And 
> a multi-node model parallel parameter server will be discussed if time 
> permits. 
> Push/Pull service: 
> In general, we could launch a parameter server inside (local multi-thread 
> backend) or outside (spark distributed backend) of CP to provide the pull and 
> push service. For the moment, all the weights and biases are saved in a 
> hashmap using a key, e.g., "global parameter". Each worker's gradients will 
> be put into the hashmap seperately with a given key. And the exchange between 
> server and workers will be implemented by TCP. Hence, we could easily 
> broadcast the IP address and the port number to the workers. And then the 
> workers can send the gradients and retrieve the new parameters via TCP 
> socket. The server will also spawn a thread which retrieves the gradients by 
> polling the hashmap using relevant keys and aggregates them. At last, it 
> updates the global parameter in the hashmap.
> Synchronization:
> We also need to implement the synchronization between workers and parameter 
> server to be able to bring more parameter update strategies, e.g., the 
> stale-synchronous strategy needs a hyperparameter "staleness" to define the 
> waiting interval. The idea is to maintain a vector clock recording all 
> workers' clock in the server. Each time when an iteration in side of worker 
> finishes, it waits server to give a signal, i.e., to send a request for 
> calculating the staleness according to the vector clock. And when the server 
> receives the gradients from certain worker, it will increment the vector 
> clock for this worker. So we could define BSP as "staleness==0", ASP as 
> "staleness==-1" and SSP as "staleness==N".
> A diagram of the parameter server architecture is shown below.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:

Attachment: (was: ps.png)

> Single-node parameter server primitives
> ---
>
> Key: SYSTEMML-2085
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
> Attachments: ps.png
>
>
> A single node parameter server acts as a data-parallel parameter server. And 
> a multi-node model parallel parameter server will be discussed if time 
> permits. 
> Push/Pull service: 
> In general, we could launch a parameter server inside (local multi-thread 
> backend) or outside (spark distributed backend) of CP to provide the pull and 
> push service. For the moment, all the weights and biases are saved in a 
> hashmap using a key, e.g., "global parameter". Each worker's gradients will 
> be put into the hashmap seperately with a given key. And the exchange between 
> server and workers will be implemented by TCP. Hence, we could easily 
> broadcast the IP address and the port number to the workers. And then the 
> workers can send the gradients and retrieve the new parameters via TCP 
> socket. The server will also spawn a thread which retrieves the gradients by 
> polling the hashmap using relevant keys and aggregates them. At last, it 
> updates the global parameter in the hashmap.
> Synchronization:
> We also need to implement the synchronization between workers and parameter 
> server to be able to bring more parameter update strategies, e.g., the 
> stale-synchronous strategy needs a hyperparameter "staleness" to define the 
> waiting interval. The idea is to maintain a vector clock recording all 
> workers' clock in the server. Each time when an iteration in side of worker 
> finishes, it waits server to give a signal, i.e., to send a request for 
> calculating the staleness according to the vector clock. And when the server 
> receives the gradients from certain worker, it will increment the vector 
> clock for this worker. So we could define BSP as "staleness==0", ASP as 
> "staleness==-1" and SSP as "staleness==N".
> A diagram of the parameter server architecture is shown below.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2086) Initial version of local backend

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2086:

Description: This part aims to design and implement a local execution 
backend for the compiled “paramserv” function. The idea is to spawn a thread in 
CP for running the parameter server. And the workers are also launched in 
multi-threaded way in CP.  (was: This part aims to design and implement a local 
execution backend for the compiled “paramserv” function. It consists of the 
implementations of partitioning the data for worker threads, launching the 
single-node parameter server in CP, shipping and calling the compiled 
statistical function and creating different update strategies. We will focus on
 implementing BSP execution strategies, i.e., synchronous update strategy 
including per epoch and per batch. And other update strategies (e.g. 
asynchronous, stale-synchronous) and checkpointing strategies should be 
optional and will be added if time permits. The architecture for synchronous 
per epoch update strategy is illustrated below.

The idea is to spawn a thread to launch local parameter server which is 
responsible for maintaining the parameter hashmap and executing the aggregation 
work. And then a number of workers will be forked according to the level of 
parallelism. The worker loads data partition, operates the parameter updating 
per batch, pushes the gradients and retrieves a new parameter from server. The 
server will retrieve the gradients of each worker using the related keys in a 
round robin way, aggregate the parameters and push the new global parameter 
with the parameter related keys. At last, the paramserv function main thread 
should wait for the server aggregator thread joining it and got the last global 
parameters as final result. Hence, the pull/push primitive methods can bring 
more flexibility and facilitate to implement other update strategies.)

> Initial version of local backend
> 
>
> Key: SYSTEMML-2086
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2086
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> This part aims to design and implement a local execution backend for the 
> compiled “paramserv” function. The idea is to spawn a thread in CP for 
> running the parameter server. And the workers are also launched in 
> multi-threaded way in CP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2086) Initial version of local backend

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2086:

Description: This part aims to implement a local execution backend for the 
compiled “paramserv” function. The idea is to spawn a thread in CP for running 
the parameter server. And the workers are also launched in multi-threaded way 
in CP.  (was: This part aims to design and implement a local execution backend 
for the compiled “paramserv” function. The idea is to spawn a thread in CP for 
running the parameter server. And the workers are also launched in 
multi-threaded way in CP.)

> Initial version of local backend
> 
>
> Key: SYSTEMML-2086
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2086
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> This part aims to implement a local execution backend for the compiled 
> “paramserv” function. The idea is to spawn a thread in CP for running the 
> parameter server. And the workers are also launched in multi-threaded way in 
> CP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2087) Initial version of distributed spark backend

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2087:

Description: This part aims to implement the BSP for spark distributed 
backend. Hence the idea is to be able to launch a remote parameter server and 
the workers.

> Initial version of distributed spark backend
> 
>
> Key: SYSTEMML-2087
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2087
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> This part aims to implement the BSP for spark distributed backend. Hence the 
> idea is to be able to launch a remote parameter server and the workers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2302) Second version of execution backend

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2302:

Description: This part aims to complement the updating strategies by adding 
ASP and SSP.

> Second version of execution backend
> ---
>
> Key: SYSTEMML-2302
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2302
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> This part aims to complement the updating strategies by adding ASP and SSP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2086) Initial version of local backend

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2086:

Description: This part aims to implement the BSP strategy for the local 
execution backend. The idea is to spawn a thread in CP for running the 
parameter server. And the workers are also launched in multi-threaded way in 
CP.  (was: This part aims to implement a local execution backend for the 
compiled “paramserv” function. The idea is to spawn a thread in CP for running 
the parameter server. And the workers are also launched in multi-threaded way 
in CP.)

> Initial version of local backend
> 
>
> Key: SYSTEMML-2086
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2086
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> This part aims to implement the BSP strategy for the local execution backend. 
> The idea is to spawn a thread in CP for running the parameter server. And the 
> workers are also launched in multi-threaded way in CP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2084) Implementation of language and compiler extension

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2084:

Due Date: 25/May/18  (was: 28/May/18)

> Implementation of language and compiler extension
> -
>
> Key: SYSTEMML-2084
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2084
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> This part aims to add an additional language support for the “paramserv” 
> function in order to be able to compile this new function. Since SystemML 
> already supports the parameterized builtin function, we can easily extend an 
> additional operation type and generate a new instruction for the “paramserv” 
> function. Recently, we have also added a new “eval” built-in function which 
> is capable to pass a function pointer as argument so that it can be called in 
> runtime. Similar to it, we would need to extend the inter-procedural analysis 
> to avoid removing unused constructed functions in the presence of 
> second-order “paramserv” function. Because the referenced functions, i.e., 
> the aggregate function and update function, should be present in runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:

Due Date: 1/Jun/18  (was: 4/Jun/18)

> Single-node parameter server primitives
> ---
>
> Key: SYSTEMML-2085
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
> Attachments: ps.png
>
>
> A single node parameter server acts as a data-parallel parameter server. And 
> a multi-node model parallel parameter server will be discussed if time 
> permits. 
> Push/Pull service: 
> In general, we could launch a parameter server inside (local multi-thread 
> backend) or outside (spark distributed backend) of CP to provide the pull and 
> push service. For the moment, all the weights and biases are saved in a 
> hashmap using a key, e.g., "global parameter". Each worker's gradients will 
> be put into the hashmap seperately with a given key. And the exchange between 
> server and workers will be implemented by TCP. Hence, we could easily 
> broadcast the IP address and the port number to the workers. And then the 
> workers can send the gradients and retrieve the new parameters via TCP 
> socket. The server will also spawn a thread which retrieves the gradients by 
> polling the hashmap using relevant keys and aggregates them. At last, it 
> updates the global parameter in the hashmap.
> Synchronization:
> We also need to implement the synchronization between workers and parameter 
> server to be able to bring more parameter update strategies, e.g., the 
> stale-synchronous strategy needs a hyperparameter "staleness" to define the 
> waiting interval. The idea is to maintain a vector clock recording all 
> workers' clock in the server. Each time when an iteration in side of worker 
> finishes, it waits server to give a signal, i.e., to send a request for 
> calculating the staleness according to the vector clock. And when the server 
> receives the gradients from certain worker, it will increment the vector 
> clock for this worker. So we could define BSP as "staleness==0", ASP as 
> "staleness==-1" and SSP as "staleness==N".
> A diagram of the parameter server architecture is shown below.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2086) Initial version of local backend

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2086:

Due Date: 22/Jun/18  (was: 25/Jun/18)

> Initial version of local backend
> 
>
> Key: SYSTEMML-2086
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2086
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> This part aims to implement the BSP strategy for the local execution backend. 
> The idea is to spawn a thread in CP for running the parameter server. And the 
> workers are also launched in multi-threaded way in CP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2087) Initial version of distributed spark backend

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2087:

Due Date: 6/Jul/18  (was: 9/Jul/18)

> Initial version of distributed spark backend
> 
>
> Key: SYSTEMML-2087
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2087
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> This part aims to implement the BSP for spark distributed backend. Hence the 
> idea is to be able to launch a remote parameter server and the workers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2302) Second version of execution backend

2018-05-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2302:

Due Date: 27/Jul/18  (was: 6/Aug/18)

> Second version of execution backend
> ---
>
> Key: SYSTEMML-2302
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2302
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> This part aims to complement the updating strategies by adding ASP and SSP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function

2018-05-11 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2299:

Description: The objective of “paramserv” built-in function is to update an 
initial or existing model with configuration. An initial function signature 
would be _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, 
freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. 
We are interested in providing the model (which will be a struct-like data 
structure consisting of the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function, the update strategy (e.g. sync, async, hogwild!, 
stale-synchronous), the update frequency (e.g. epoch or mini-batch), the 
gradient aggregation function, the number of epoch, the batch size, the degree 
of parallelism as well as the checkpointing strategy (e.g. rollback recovery). 
And the function will return a trained model in struct format.  (was: The 
objective of “paramserv” built-in function is to update an initial or existing 
model with configuration. An initial function signature would be 
_model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, 
agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are 
interested in providing the model (which will be a struct-like data structure 
consisting the weights, the biases and the hyperparameters), the training 
features and labels, the validation features and labels, the batch update 
function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), 
the update frequency (e.g. epoch or mini-batch), the gradient aggregation 
function, the number of epoch, the batch size, the degree of parallelism as 
well as the checkpointing strategy (e.g. rollback recovery). And the function 
will return a trained model in struct format.)

> API design of the paramserv function
> 
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> The objective of “paramserv” built-in function is to update an initial or 
> existing model with configuration. An initial function signature would be 
> _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, 
> agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are 
> interested in providing the model (which will be a struct-like data structure 
> consisting of the weights, the biases and the hyperparameters), the training 
> features and labels, the validation features and labels, the batch update 
> function, the update strategy (e.g. sync, async, hogwild!, 
> stale-synchronous), the update frequency (e.g. epoch or mini-batch), the 
> gradient aggregation function, the number of epoch, the batch size, the 
> degree of parallelism as well as the checkpointing strategy (e.g. rollback 
> recovery). And the function will return a trained model in struct format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function

2018-05-11 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2299:

Description: The objective of “paramserv” built-in function is to update an 
initial or existing model with configuration. An initial function signature 
would be _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, 
freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. 
We are interested in providing the model (which will be a struct-like data 
structure consisting of the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function (i.e., gradient calculation func), the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
mini-batch), the gradient aggregation function, the number of epoch, the batch 
size, the degree of parallelism as well as the checkpointing strategy (e.g. 
rollback recovery). And the function will return a trained model in struct 
format.  (was: The objective of “paramserv” built-in function is to update an 
initial or existing model with configuration. An initial function signature 
would be _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, 
freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. 
We are interested in providing the model (which will be a struct-like data 
structure consisting of the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function, the update strategy (e.g. sync, async, hogwild!, 
stale-synchronous), the update frequency (e.g. epoch or mini-batch), the 
gradient aggregation function, the number of epoch, the batch size, the degree 
of parallelism as well as the checkpointing strategy (e.g. rollback recovery). 
And the function will return a trained model in struct format.)

> API design of the paramserv function
> 
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> The objective of “paramserv” built-in function is to update an initial or 
> existing model with configuration. An initial function signature would be 
> _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, 
> agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are 
> interested in providing the model (which will be a struct-like data structure 
> consisting of the weights, the biases and the hyperparameters), the training 
> features and labels, the validation features and labels, the batch update 
> function (i.e., gradient calculation func), the update strategy (e.g. sync, 
> async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
> mini-batch), the gradient aggregation function, the number of epoch, the 
> batch size, the degree of parallelism as well as the checkpointing strategy 
> (e.g. rollback recovery). And the function will return a trained model in 
> struct format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (SYSTEMML-2298) Preparation of dev environment

2018-05-11 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2298.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Preparation of dev environment
> --
>
> Key: SYSTEMML-2298
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2298
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>
> During the bonding time, all the development environment should be well 
> prepared. The native library OpenBLAS should be installed in order to run the 
> MNIST LeNet example. And then by leveraging the MNIST LeNet data generator 
> ([http://leon.bottou.org/projects/infimnist]), we could generate 256k 
> instances to train the model.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function

2018-05-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2299:

Description: 
The objective of “paramserv” built-in function is to update an initial or 
existing model with configuration. An initial function signature would be: 

 
{code:java}
model'=paramserv(model, X, y, X_val, y_val, upd=fun1, agg=fun2, mode=BSP, 
freq=EPOCH, epochs=100, batchsize=64, k=7, hyperparam=params, 
checkpoint=NONE){code}
 

We are interested in providing the model (which will be a struct-like data 
structure consisting of the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function (i.e., gradient calculation func), the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
mini-batch), the gradient aggregation function, the number of epoch, the batch 
size, the degree of parallelism as well as the checkpointing strategy (e.g. 
rollback recovery). And the function will return a trained model in struct 
format.

  was:The objective of “paramserv” built-in function is to update an initial or 
existing model with configuration. An initial function signature would be 
_model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, 
agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are 
interested in providing the model (which will be a struct-like data structure 
consisting of the weights, the biases and the hyperparameters), the training 
features and labels, the validation features and labels, the batch update 
function (i.e., gradient calculation func), the update strategy (e.g. sync, 
async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
mini-batch), the gradient aggregation function, the number of epoch, the batch 
size, the degree of parallelism as well as the checkpointing strategy (e.g. 
rollback recovery). And the function will return a trained model in struct 
format.


> API design of the paramserv function
> 
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> The objective of “paramserv” built-in function is to update an initial or 
> existing model with configuration. An initial function signature would be: 
>  
> {code:java}
> model'=paramserv(model, X, y, X_val, y_val, upd=fun1, agg=fun2, mode=BSP, 
> freq=EPOCH, epochs=100, batchsize=64, k=7, hyperparam=params, 
> checkpoint=NONE){code}
>  
> We are interested in providing the model (which will be a struct-like data 
> structure consisting of the weights, the biases and the hyperparameters), the 
> training features and labels, the validation features and labels, the batch 
> update function (i.e., gradient calculation func), the update strategy (e.g. 
> sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch 
> or mini-batch), the gradient aggregation function, the number of epoch, the 
> batch size, the degree of parallelism as well as the checkpointing strategy 
> (e.g. rollback recovery). And the function will return a trained model in 
> struct format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function

2018-05-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2299:

Description: 
The objective of “paramserv” built-in function is to update an initial or 
existing model with configuration. An initial function signature would be: 

 
{code:java}
model'=paramserv(model, X, y, X_val, y_val, upd=fun1, agg=fun2, mode=BSP, 
freq=EPOCH, epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, 
hyperparam=params, checkpoint=NONE){code}
 

We are interested in providing the model (which will be a struct-like data 
structure consisting of the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function (i.e., gradient calculation func), the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
mini-batch), the gradient aggregation function, the number of epoch, the batch 
size, the degree of parallelism as well as the checkpointing strategy (e.g. 
rollback recovery). And the function will return a trained model in struct 
format.

  was:
The objective of “paramserv” built-in function is to update an initial or 
existing model with configuration. An initial function signature would be: 

 
{code:java}
model'=paramserv(model, X, y, X_val, y_val, upd=fun1, agg=fun2, mode=BSP, 
freq=EPOCH, epochs=100, batchsize=64, k=7, hyperparam=params, 
checkpoint=NONE){code}
 

We are interested in providing the model (which will be a struct-like data 
structure consisting of the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function (i.e., gradient calculation func), the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
mini-batch), the gradient aggregation function, the number of epoch, the batch 
size, the degree of parallelism as well as the checkpointing strategy (e.g. 
rollback recovery). And the function will return a trained model in struct 
format.


> API design of the paramserv function
> 
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> The objective of “paramserv” built-in function is to update an initial or 
> existing model with configuration. An initial function signature would be: 
>  
> {code:java}
> model'=paramserv(model, X, y, X_val, y_val, upd=fun1, agg=fun2, mode=BSP, 
> freq=EPOCH, epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, 
> hyperparam=params, checkpoint=NONE){code}
>  
> We are interested in providing the model (which will be a struct-like data 
> structure consisting of the weights, the biases and the hyperparameters), the 
> training features and labels, the validation features and labels, the batch 
> update function (i.e., gradient calculation func), the update strategy (e.g. 
> sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch 
> or mini-batch), the gradient aggregation function, the number of epoch, the 
> batch size, the degree of parallelism as well as the checkpointing strategy 
> (e.g. rollback recovery). And the function will return a trained model in 
> struct format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function

2018-05-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2299:

Description: 
The objective of “paramserv” built-in function is to update an initial or 
existing model with configuration. An initial function signature would be: 
{code:java}
model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", mode="BSP", 
freq="EPOCH", epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, 
hyperparam=params, checkpoint=NONE){code}
We are interested in providing the model (which will be a struct-like data 
structure consisting of the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function (i.e., gradient calculation func), the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
mini-batch), the gradient aggregation function, the number of epoch, the batch 
size, the degree of parallelism, the data partition scheme, a list of 
additional hyper parameters, as well as the checkpointing strategy. And the 
function will return a trained model in struct format.

*Inputs*:
 * model  [: a list consisting of the weight and bias matrices
 * X : training features matrix
 * y : training label matrix
 * X_val : validation features matrix
 * y_val : validation label matrix
 * upd : the name of gradient calculation function
 * agg : the name of gradient aggregation function
 * mode  (options: BSP, ASP, SSP): the updating mode
 * freq  (options: EPOCH, BATCH): the frequence of updates
 * epochs : the number of epoch
 * batchsize : the size of batch
 * k : the degree of parallelism
 * scheme  (options: disjoint_contiguous, disjoint_round_robin, 
disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how 
the data is distributed across workers
 * hyperparam : a list consisting of the additional hyper parameters, 
e.g., learning rate, momentum
 * checkpoint  (options: NONE, EPOCH, EPOCH10): the checkpoint 
strategy, we could set a checkpoint for each epoch or each 10 epochs 

  was:
The objective of “paramserv” built-in function is to update an initial or 
existing model with configuration. An initial function signature would be: 

 
{code:java}
model'=paramserv(model, X, y, X_val, y_val, upd=fun1, agg=fun2, mode=BSP, 
freq=EPOCH, epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, 
hyperparam=params, checkpoint=NONE){code}
 

We are interested in providing the model (which will be a struct-like data 
structure consisting of the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function (i.e., gradient calculation func), the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
mini-batch), the gradient aggregation function, the number of epoch, the batch 
size, the degree of parallelism as well as the checkpointing strategy (e.g. 
rollback recovery). And the function will return a trained model in struct 
format.


> API design of the paramserv function
> 
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> The objective of “paramserv” built-in function is to update an initial or 
> existing model with configuration. An initial function signature would be: 
> {code:java}
> model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", 
> mode="BSP", freq="EPOCH", epochs=100, batchsize=64, k=7, 
> scheme=disjoint_contiguous, hyperparam=params, checkpoint=NONE){code}
> We are interested in providing the model (which will be a struct-like data 
> structure consisting of the weights, the biases and the hyperparameters), the 
> training features and labels, the validation features and labels, the batch 
> update function (i.e., gradient calculation func), the update strategy (e.g. 
> sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch 
> or mini-batch), the gradient aggregation function, the number of epoch, the 
> batch size, the degree of parallelism, the data partition scheme, a list of 
> additional hyper parameters, as well as the checkpointing strategy. And the 
> function will return a trained model in struct format.
> *Inputs*:
>  * model  [: a list consisting of the weight and bias matrices
>  * X : training features matrix
>  * y : training label matrix
>  * X_val : validation features matrix
>  * y_val : validation label matrix
>  * upd : the name of gradient calculation function
>  * agg : the name of gradient aggregation function
>  * mode  (options: BSP, ASP, SSP): the updating mode
>  * freq  (options

[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function

2018-05-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2299:

Description: 
The objective of “paramserv” built-in function is to update an initial or 
existing model with configuration. An initial function signature would be: 
{code:java}
model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", mode="BSP", 
freq="EPOCH", epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, 
hyperparam=params, checkpoint=NONE){code}
We are interested in providing the model (which will be a struct-like data 
structure consisting of the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function (i.e., gradient calculation func), the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
mini-batch), the gradient aggregation function, the number of epoch, the batch 
size, the degree of parallelism, the data partition scheme, a list of 
additional hyper parameters, as well as the checkpointing strategy. And the 
function will return a trained model in struct format.

*Inputs*:
 * model : a list consisting of the weight and bias matrices
 * X : training features matrix
 * y : training label matrix
 * X_val : validation features matrix
 * y_val : validation label matrix
 * upd : the name of gradient calculation function
 * agg : the name of gradient aggregation function
 * mode  (options: BSP, ASP, SSP): the updating mode
 * freq  (options: EPOCH, BATCH): the frequence of updates
 * epochs : the number of epoch
 * batchsize : the size of batch
 * k : the degree of parallelism
 * scheme  (options: disjoint_contiguous, disjoint_round_robin, 
disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how 
the data is distributed across workers
 * hyperparam  [optional]: a list consisting of the additional hyper 
parameters, e.g., learning rate, momentum
 * checkpoint  (options: NONE(default), EPOCH, EPOCH10) [optional]: the 
checkpoint strategy, we could set a checkpoint for each epoch or each 10 epochs 

  was:
The objective of “paramserv” built-in function is to update an initial or 
existing model with configuration. An initial function signature would be: 
{code:java}
model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", mode="BSP", 
freq="EPOCH", epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, 
hyperparam=params, checkpoint=NONE){code}
We are interested in providing the model (which will be a struct-like data 
structure consisting of the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function (i.e., gradient calculation func), the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
mini-batch), the gradient aggregation function, the number of epoch, the batch 
size, the degree of parallelism, the data partition scheme, a list of 
additional hyper parameters, as well as the checkpointing strategy. And the 
function will return a trained model in struct format.

*Inputs*:
 * model  [: a list consisting of the weight and bias matrices
 * X : training features matrix
 * y : training label matrix
 * X_val : validation features matrix
 * y_val : validation label matrix
 * upd : the name of gradient calculation function
 * agg : the name of gradient aggregation function
 * mode  (options: BSP, ASP, SSP): the updating mode
 * freq  (options: EPOCH, BATCH): the frequence of updates
 * epochs : the number of epoch
 * batchsize : the size of batch
 * k : the degree of parallelism
 * scheme  (options: disjoint_contiguous, disjoint_round_robin, 
disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how 
the data is distributed across workers
 * hyperparam : a list consisting of the additional hyper parameters, 
e.g., learning rate, momentum
 * checkpoint  (options: NONE, EPOCH, EPOCH10): the checkpoint 
strategy, we could set a checkpoint for each epoch or each 10 epochs 


> API design of the paramserv function
> 
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> The objective of “paramserv” built-in function is to update an initial or 
> existing model with configuration. An initial function signature would be: 
> {code:java}
> model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", 
> mode="BSP", freq="EPOCH", epochs=100, batchsize=64, k=7, 
> scheme=disjoint_contiguous, hyperparam=params, checkpoint=NONE){code}
> We are interested in providing the model (which will be a struct-like data 
>

[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function

2018-05-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2299:

Description: 
The objective of “paramserv” built-in function is to update an initial or 
existing model with configuration. An initial function signature would be: 
{code:java}
model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", mode="BSP", 
freq="EPOCH", epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, 
hyperparam=params, checkpoint=NONE){code}
We are interested in providing the model (which will be a struct-like data 
structure consisting of the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function (i.e., gradient calculation func), the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
mini-batch), the gradient aggregation function, the number of epoch, the batch 
size, the degree of parallelism, the data partition scheme, a list of 
additional hyper parameters, as well as the checkpointing strategy. And the 
function will return a trained model in struct format.

*Inputs*:
 * model : a list consisting of the weight and bias matrices
 * X : training features matrix
 * y : training label matrix
 * X_val : validation features matrix
 * y_val : validation label matrix
 * upd : the name of gradient calculation function
 * agg : the name of gradient aggregation function
 * mode  (options: BSP, ASP, SSP): the updating mode
 * freq  (options: EPOCH, BATCH): the frequence of updates
 * epochs : the number of epoch
 * batchsize : the size of batch
 * k : the degree of parallelism
 * scheme  (options: disjoint_contiguous, disjoint_round_robin, 
disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how 
the data is distributed across workers
 * hyperparam  [optional]: a list consisting of the additional hyper 
parameters, e.g., learning rate, momentum
 * checkpoint  (options: NONE(default), EPOCH, EPOCH10) [optional]: the 
checkpoint strategy, we could set a checkpoint for each epoch or each 10 epochs 

Output:
 * model' : a list consisting of the updated weight and bias matrices

  was:
The objective of “paramserv” built-in function is to update an initial or 
existing model with configuration. An initial function signature would be: 
{code:java}
model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", mode="BSP", 
freq="EPOCH", epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, 
hyperparam=params, checkpoint=NONE){code}
We are interested in providing the model (which will be a struct-like data 
structure consisting of the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function (i.e., gradient calculation func), the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
mini-batch), the gradient aggregation function, the number of epoch, the batch 
size, the degree of parallelism, the data partition scheme, a list of 
additional hyper parameters, as well as the checkpointing strategy. And the 
function will return a trained model in struct format.

*Inputs*:
 * model : a list consisting of the weight and bias matrices
 * X : training features matrix
 * y : training label matrix
 * X_val : validation features matrix
 * y_val : validation label matrix
 * upd : the name of gradient calculation function
 * agg : the name of gradient aggregation function
 * mode  (options: BSP, ASP, SSP): the updating mode
 * freq  (options: EPOCH, BATCH): the frequence of updates
 * epochs : the number of epoch
 * batchsize : the size of batch
 * k : the degree of parallelism
 * scheme  (options: disjoint_contiguous, disjoint_round_robin, 
disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how 
the data is distributed across workers
 * hyperparam  [optional]: a list consisting of the additional hyper 
parameters, e.g., learning rate, momentum
 * checkpoint  (options: NONE(default), EPOCH, EPOCH10) [optional]: the 
checkpoint strategy, we could set a checkpoint for each epoch or each 10 epochs 


> API design of the paramserv function
> 
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> The objective of “paramserv” built-in function is to update an initial or 
> existing model with configuration. An initial function signature would be: 
> {code:java}
> model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", 
> mode="BSP", freq="EPOCH", epochs=100, batchsize=64, k=7, 
> scheme=disjoint_contiguous, hyperparam=par

[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function

2018-05-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2299:

Description: 
The objective of “paramserv” built-in function is to update an initial or 
existing model with configuration. An initial function signature would be: 
{code:java}
model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", mode="BSP", 
freq="EPOCH", epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, 
hyperparam=params, checkpoint=NONE){code}
We are interested in providing the model (which will be a struct-like data 
structure consisting of the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function (i.e., gradient calculation func), the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
mini-batch), the gradient aggregation function, the number of epoch, the batch 
size, the degree of parallelism, the data partition scheme, a list of 
additional hyper parameters, as well as the checkpointing strategy. And the 
function will return a trained model in struct format.

*Inputs*:
 * model : a list consisting of the weight and bias matrices
 * X : training features matrix
 * y : training label matrix
 * X_val : validation features matrix
 * y_val : validation label matrix
 * upd : the name of gradient calculation function
 * agg : the name of gradient aggregation function
 * mode  (options: BSP, ASP, SSP): the updating mode
 * freq  (options: EPOCH, BATCH): the frequence of updates
 * epochs : the number of epoch
 * batchsize : the size of batch
 * k : the degree of parallelism
 * scheme  (options: disjoint_contiguous, disjoint_round_robin, 
disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how 
the data is distributed across workers
 * hyperparam  [optional]: a list consisting of the additional hyper 
parameters, e.g., learning rate, momentum
 * checkpoint  (options: NONE(default), EPOCH, EPOCH10) [optional]: the 
checkpoint strategy, we could set a checkpoint for each epoch or each 10 epochs 

*Output*:
 * model' : a list consisting of the updated weight and bias matrices

  was:
The objective of “paramserv” built-in function is to update an initial or 
existing model with configuration. An initial function signature would be: 
{code:java}
model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", mode="BSP", 
freq="EPOCH", epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, 
hyperparam=params, checkpoint=NONE){code}
We are interested in providing the model (which will be a struct-like data 
structure consisting of the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function (i.e., gradient calculation func), the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
mini-batch), the gradient aggregation function, the number of epoch, the batch 
size, the degree of parallelism, the data partition scheme, a list of 
additional hyper parameters, as well as the checkpointing strategy. And the 
function will return a trained model in struct format.

*Inputs*:
 * model : a list consisting of the weight and bias matrices
 * X : training features matrix
 * y : training label matrix
 * X_val : validation features matrix
 * y_val : validation label matrix
 * upd : the name of gradient calculation function
 * agg : the name of gradient aggregation function
 * mode  (options: BSP, ASP, SSP): the updating mode
 * freq  (options: EPOCH, BATCH): the frequence of updates
 * epochs : the number of epoch
 * batchsize : the size of batch
 * k : the degree of parallelism
 * scheme  (options: disjoint_contiguous, disjoint_round_robin, 
disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how 
the data is distributed across workers
 * hyperparam  [optional]: a list consisting of the additional hyper 
parameters, e.g., learning rate, momentum
 * checkpoint  (options: NONE(default), EPOCH, EPOCH10) [optional]: the 
checkpoint strategy, we could set a checkpoint for each epoch or each 10 epochs 

Output:
 * model' : a list consisting of the updated weight and bias matrices


> API design of the paramserv function
> 
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> The objective of “paramserv” built-in function is to update an initial or 
> existing model with configuration. An initial function signature would be: 
> {code:java}
> model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", 
> mode="BSP", freq="EPO

[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function

2018-05-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2299:

Description: 
The objective of “paramserv” built-in function is to update an initial or 
existing model with configuration. An initial function signature would be: 
{code:java}
model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", mode="BSP", 
freq="EPOCH", epochs=100, batchsize=64, k=7, scheme="disjoint_contiguous", 
hyperparam=params, checkpoint="NONE"){code}
We are interested in providing the model (which will be a struct-like data 
structure consisting of the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function (i.e., gradient calculation func), the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
mini-batch), the gradient aggregation function, the number of epoch, the batch 
size, the degree of parallelism, the data partition scheme, a list of 
additional hyper parameters, as well as the checkpointing strategy. And the 
function will return a trained model in struct format.

*Inputs*:
 * model : a list consisting of the weight and bias matrices
 * X : training features matrix
 * y : training label matrix
 * X_val : validation features matrix
 * y_val : validation label matrix
 * upd : the name of gradient calculation function
 * agg : the name of gradient aggregation function
 * mode  (options: BSP, ASP, SSP): the updating mode
 * freq  (options: EPOCH, BATCH): the frequence of updates
 * epochs : the number of epoch
 * batchsize : the size of batch
 * k : the degree of parallelism
 * scheme  (options: disjoint_contiguous, disjoint_round_robin, 
disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how 
the data is distributed across workers
 * hyperparam  [optional]: a list consisting of the additional hyper 
parameters, e.g., learning rate, momentum
 * checkpoint  (options: NONE(default), EPOCH, EPOCH10) [optional]: the 
checkpoint strategy, we could set a checkpoint for each epoch or each 10 epochs 

*Output*:
 * model' : a list consisting of the updated weight and bias matrices

  was:
The objective of “paramserv” built-in function is to update an initial or 
existing model with configuration. An initial function signature would be: 
{code:java}
model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", mode="BSP", 
freq="EPOCH", epochs=100, batchsize=64, k=7, scheme=disjoint_contiguous, 
hyperparam=params, checkpoint=NONE){code}
We are interested in providing the model (which will be a struct-like data 
structure consisting of the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function (i.e., gradient calculation func), the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
mini-batch), the gradient aggregation function, the number of epoch, the batch 
size, the degree of parallelism, the data partition scheme, a list of 
additional hyper parameters, as well as the checkpointing strategy. And the 
function will return a trained model in struct format.

*Inputs*:
 * model : a list consisting of the weight and bias matrices
 * X : training features matrix
 * y : training label matrix
 * X_val : validation features matrix
 * y_val : validation label matrix
 * upd : the name of gradient calculation function
 * agg : the name of gradient aggregation function
 * mode  (options: BSP, ASP, SSP): the updating mode
 * freq  (options: EPOCH, BATCH): the frequence of updates
 * epochs : the number of epoch
 * batchsize : the size of batch
 * k : the degree of parallelism
 * scheme  (options: disjoint_contiguous, disjoint_round_robin, 
disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how 
the data is distributed across workers
 * hyperparam  [optional]: a list consisting of the additional hyper 
parameters, e.g., learning rate, momentum
 * checkpoint  (options: NONE(default), EPOCH, EPOCH10) [optional]: the 
checkpoint strategy, we could set a checkpoint for each epoch or each 10 epochs 

*Output*:
 * model' : a list consisting of the updated weight and bias matrices


> API design of the paramserv function
> 
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> The objective of “paramserv” built-in function is to update an initial or 
> existing model with configuration. An initial function signature would be: 
> {code:java}
> model'=paramserv(model, X, y, X_val, y_val, upd="fun1", agg="fun2", 
> mode="BSP", fre

[jira] [Created] (SYSTEMML-2317) Implementation of language extension

2018-05-13 Thread LI Guobao (JIRA)

LI Guobao created SYSTEMML-2317:
---

 Summary: Implementation of language extension
 Key: SYSTEMML-2317
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2317
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao


It aims to extend the parsing and validation at language level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (SYSTEMML-2318) Hops, lops, instruction generation

2018-05-13 Thread LI Guobao (JIRA)

LI Guobao created SYSTEMML-2318:
---

 Summary: Hops, lops, instruction generation
 Key: SYSTEMML-2318
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2318
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao


It aims to implement the extension of hops, lops and instruction for the new 
paramserv function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (SYSTEMML-2319) IPA integration

2018-05-13 Thread LI Guobao (JIRA)

LI Guobao created SYSTEMML-2319:
---

 Summary: IPA integration
 Key: SYSTEMML-2319
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2319
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao


It aims to extend the IPA to avoid removing the referenced functions due to the 
fact that the paramserv function is a second-order function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (SYSTEMML-2320) Parfor integration

2018-05-13 Thread LI Guobao (JIRA)

LI Guobao created SYSTEMML-2320:
---

 Summary: Parfor integration
 Key: SYSTEMML-2320
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2320
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao


It aims to garanti the robustness for the case that the paramserv function is 
used inside a parfor statement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2084) Implementation of language and compiler extension

2018-05-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2084:

Issue Type: Technical task  (was: Sub-task)

> Implementation of language and compiler extension
> -
>
> Key: SYSTEMML-2084
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2084
> Project: SystemML
>  Issue Type: Technical task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> This part aims to add an additional language support for the “paramserv” 
> function in order to be able to compile this new function. Since SystemML 
> already supports the parameterized builtin function, we can easily extend an 
> additional operation type and generate a new instruction for the “paramserv” 
> function. Recently, we have also added a new “eval” built-in function which 
> is capable to pass a function pointer as argument so that it can be called in 
> runtime. Similar to it, we would need to extend the inter-procedural analysis 
> to avoid removing unused constructed functions in the presence of 
> second-order “paramserv” function. Because the referenced functions, i.e., 
> the aggregate function and update function, should be present in runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives

2018-05-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:

Issue Type: Technical task  (was: Sub-task)

> Single-node parameter server primitives
> ---
>
> Key: SYSTEMML-2085
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
> Project: SystemML
>  Issue Type: Technical task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
> Attachments: ps.png
>
>
> A single node parameter server acts as a data-parallel parameter server. And 
> a multi-node model parallel parameter server will be discussed if time 
> permits. 
> Push/Pull service: 
> In general, we could launch a parameter server inside (local multi-thread 
> backend) or outside (spark distributed backend) of CP to provide the pull and 
> push service. For the moment, all the weights and biases are saved in a 
> hashmap using a key, e.g., "global parameter". Each worker's gradients will 
> be put into the hashmap seperately with a given key. And the exchange between 
> server and workers will be implemented by TCP. Hence, we could easily 
> broadcast the IP address and the port number to the workers. And then the 
> workers can send the gradients and retrieve the new parameters via TCP 
> socket. The server will also spawn a thread which retrieves the gradients by 
> polling the hashmap using relevant keys and aggregates them. At last, it 
> updates the global parameter in the hashmap.
> Synchronization:
> We also need to implement the synchronization between workers and parameter 
> server to be able to bring more parameter update strategies, e.g., the 
> stale-synchronous strategy needs a hyperparameter "staleness" to define the 
> waiting interval. The idea is to maintain a vector clock recording all 
> workers' clock in the server. Each time when an iteration in side of worker 
> finishes, it waits server to give a signal, i.e., to send a request for 
> calculating the staleness according to the vector clock. And when the server 
> receives the gradients from certain worker, it will increment the vector 
> clock for this worker. So we could define BSP as "staleness==0", ASP as 
> "staleness==-1" and SSP as "staleness==N".
> A diagram of the parameter server architecture is shown below.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2087) Initial version of distributed spark backend

2018-05-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2087:

Description: This part aims to implement the parameter server for spark 
distributed backend. In general, we could launch a parameter server in a host 
to provide the pull and push service. For the moment, all the weights and 
biases are saved in a hashmap using a key, e.g., "global parameter". Each 
worker's gradients will be put into the hashmap seperately with a given key. 
And the exchange between server and workers will be implemented by netty RPC. 
Hence, we could easily broadcast the IP address and the port number to the 
workers. And then the workers can send the gradients and retrieve the new 
parameters via TCP socket. The server will also spawn a thread which retrieves 
the gradients by polling the hashmap using relevant keys and aggregates them. 
At last, it updates the global parameter in the hashmap.  (was: This part aims 
to implement the BSP for spark distributed backend. Hence the idea is to be 
able to launch a remote parameter server and the workers.)

> Initial version of distributed spark backend
> 
>
> Key: SYSTEMML-2087
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2087
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> This part aims to implement the parameter server for spark distributed 
> backend. In general, we could launch a parameter server in a host to provide 
> the pull and push service. For the moment, all the weights and biases are 
> saved in a hashmap using a key, e.g., "global parameter". Each worker's 
> gradients will be put into the hashmap seperately with a given key. And the 
> exchange between server and workers will be implemented by netty RPC. Hence, 
> we could easily broadcast the IP address and the port number to the workers. 
> And then the workers can send the gradients and retrieve the new parameters 
> via TCP socket. The server will also spawn a thread which retrieves the 
> gradients by polling the hashmap using relevant keys and aggregates them. At 
> last, it updates the global parameter in the hashmap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2087) Initial version of distributed spark backend

2018-05-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2087:

Description: This part aims to implement the parameter server for spark 
distributed backend. In general, we could launch a parameter server in a host 
to provide the pull and push service. For the moment, all the weights and 
biases are saved in a hashmap using a key, e.g., "global parameter". Each 
worker's gradients will be put into the hashmap seperately with a given key. 
And the exchange between server and workers will be implemented by netty RPC. 
Hence, we could easily broadcast the IP address and the port number to the 
workers. And then the workers can send the gradients and retrieve the new 
parameters via netty RPC. The server will also spawn a thread which retrieves 
the gradients by polling the hashmap using relevant keys and aggregates them. 
At last, it updates the global parameter in the hashmap.  (was: This part aims 
to implement the parameter server for spark distributed backend. In general, we 
could launch a parameter server in a host to provide the pull and push service. 
For the moment, all the weights and biases are saved in a hashmap using a key, 
e.g., "global parameter". Each worker's gradients will be put into the hashmap 
seperately with a given key. And the exchange between server and workers will 
be implemented by netty RPC. Hence, we could easily broadcast the IP address 
and the port number to the workers. And then the workers can send the gradients 
and retrieve the new parameters via TCP socket. The server will also spawn a 
thread which retrieves the gradients by polling the hashmap using relevant keys 
and aggregates them. At last, it updates the global parameter in the hashmap.)

> Initial version of distributed spark backend
> 
>
> Key: SYSTEMML-2087
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2087
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> This part aims to implement the parameter server for spark distributed 
> backend. In general, we could launch a parameter server in a host to provide 
> the pull and push service. For the moment, all the weights and biases are 
> saved in a hashmap using a key, e.g., "global parameter". Each worker's 
> gradients will be put into the hashmap seperately with a given key. And the 
> exchange between server and workers will be implemented by netty RPC. Hence, 
> we could easily broadcast the IP address and the port number to the workers. 
> And then the workers can send the gradients and retrieve the new parameters 
> via netty RPC. The server will also spawn a thread which retrieves the 
> gradients by polling the hashmap using relevant keys and aggregates them. At 
> last, it updates the global parameter in the hashmap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (SYSTEMML-2321) Aggregation service

2018-05-13 Thread LI Guobao (JIRA)

LI Guobao created SYSTEMML-2321:
---

 Summary: Aggregation service
 Key: SYSTEMML-2321
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2321
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao


The aggregation service is independant of local or remote workers. It is 
responsible for executing the parameter updating.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (SYSTEMML-2322) Local workers

2018-05-13 Thread LI Guobao (JIRA)

LI Guobao created SYSTEMML-2322:
---

 Summary: Local workers
 Key: SYSTEMML-2322
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2322
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao


It aims to implement the local workers. And it also contains the data 
management such as data distribution, program separation via function 
replication.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (SYSTEMML-2323) Checkpointing

2018-05-13 Thread LI Guobao (JIRA)

LI Guobao created SYSTEMML-2323:
---

 Summary: Checkpointing
 Key: SYSTEMML-2323
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2323
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao


It aims to add the auxilary checkpointing service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2323) Checkpointing

2018-05-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2323:

Description: It aims to add the auxiliary checkpointing service.  (was: It 
aims to add the auxilary checkpointing service.)

> Checkpointing
> -
>
> Key: SYSTEMML-2323
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2323
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> It aims to add the auxiliary checkpointing service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives

2018-05-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:

Description: 
A single node parameter server acts as a data-parallel parameter server. And a 
multi-node model parallel parameter server will be discussed if time permits. 

Synchronization:

We also need to implement the synchronization between workers and parameter 
server to be able to bring more parameter update strategies, e.g., the 
stale-synchronous strategy needs a hyperparameter "staleness" to define the 
waiting interval. The idea is to maintain a vector clock recording all workers' 
clock in the server. Each time when an iteration in side of worker finishes, it 
waits server to give a signal, i.e., to send a request for calculating the 
staleness according to the vector clock. And when the server receives the 
gradients from certain worker, it will increment the vector clock for this 
worker. So we could define BSP as "staleness==0", ASP as "staleness==-1" and 
SSP as "staleness==N".

A diagram of the parameter server architecture is shown below.

  was:
A single node parameter server acts as a data-parallel parameter server. And a 
multi-node model parallel parameter server will be discussed if time permits. 

Push/Pull service: 

In general, we could launch a parameter server inside (local multi-thread 
backend) or outside (spark distributed backend) of CP to provide the pull and 
push service. For the moment, all the weights and biases are saved in a hashmap 
using a key, e.g., "global parameter". Each worker's gradients will be put into 
the hashmap seperately with a given key. And the exchange between server and 
workers will be implemented by TCP. Hence, we could easily broadcast the IP 
address and the port number to the workers. And then the workers can send the 
gradients and retrieve the new parameters via TCP socket. The server will also 
spawn a thread which retrieves the gradients by polling the hashmap using 
relevant keys and aggregates them. At last, it updates the global parameter in 
the hashmap.

Synchronization:

We also need to implement the synchronization between workers and parameter 
server to be able to bring more parameter update strategies, e.g., the 
stale-synchronous strategy needs a hyperparameter "staleness" to define the 
waiting interval. The idea is to maintain a vector clock recording all workers' 
clock in the server. Each time when an iteration in side of worker finishes, it 
waits server to give a signal, i.e., to send a request for calculating the 
staleness according to the vector clock. And when the server receives the 
gradients from certain worker, it will increment the vector clock for this 
worker. So we could define BSP as "staleness==0", ASP as "staleness==-1" and 
SSP as "staleness==N".

A diagram of the parameter server architecture is shown below.


> Single-node parameter server primitives
> ---
>
> Key: SYSTEMML-2085
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
> Project: SystemML
>  Issue Type: Technical task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
> Attachments: ps.png
>
>
> A single node parameter server acts as a data-parallel parameter server. And 
> a multi-node model parallel parameter server will be discussed if time 
> permits. 
> Synchronization:
> We also need to implement the synchronization between workers and parameter 
> server to be able to bring more parameter update strategies, e.g., the 
> stale-synchronous strategy needs a hyperparameter "staleness" to define the 
> waiting interval. The idea is to maintain a vector clock recording all 
> workers' clock in the server. Each time when an iteration in side of worker 
> finishes, it waits server to give a signal, i.e., to send a request for 
> calculating the staleness according to the vector clock. And when the server 
> receives the gradients from certain worker, it will increment the vector 
> clock for this worker. So we could define BSP as "staleness==0", ASP as 
> "staleness==-1" and SSP as "staleness==N".
> A diagram of the parameter server architecture is shown below.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives

2018-05-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:

Description: 
A single node parameter server acts as a data-parallel parameter server. And a 
multi-node model parallel parameter server will be discussed if time permits. 

A diagram of the parameter server architecture is shown below.

  was:
A single node parameter server acts as a data-parallel parameter server. And a 
multi-node model parallel parameter server will be discussed if time permits. 

Synchronization:

We also need to implement the synchronization between workers and parameter 
server to be able to bring more parameter update strategies, e.g., the 
stale-synchronous strategy needs a hyperparameter "staleness" to define the 
waiting interval. The idea is to maintain a vector clock recording all workers' 
clock in the server. Each time when an iteration in side of worker finishes, it 
waits server to give a signal, i.e., to send a request for calculating the 
staleness according to the vector clock. And when the server receives the 
gradients from certain worker, it will increment the vector clock for this 
worker. So we could define BSP as "staleness==0", ASP as "staleness==-1" and 
SSP as "staleness==N".

A diagram of the parameter server architecture is shown below.


> Single-node parameter server primitives
> ---
>
> Key: SYSTEMML-2085
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
> Project: SystemML
>  Issue Type: Technical task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
> Attachments: ps.png
>
>
> A single node parameter server acts as a data-parallel parameter server. And 
> a multi-node model parallel parameter server will be discussed if time 
> permits. 
> A diagram of the parameter server architecture is shown below.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (SYSTEMML-2324) Synchronization

2018-05-13 Thread LI Guobao (JIRA)

LI Guobao created SYSTEMML-2324:
---

 Summary: Synchronization
 Key: SYSTEMML-2324
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2324
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao


We also need to implement the synchronization between workers and parameter 
server to be able to bring more parameter update strategies, e.g., the 
stale-synchronous strategy needs a hyperparameter "staleness" to define the 
waiting interval. The idea is to maintain a vector clock recording all workers' 
clock in the server. Each time when an iteration in side of worker finishes, it 
waits server to give a signal, i.e., to send a request for calculating the 
staleness according to the vector clock. And when the server receives the 
gradients from certain worker, it will increment the vector clock for this 
worker. So we could define BSP as "staleness==0", ASP as "staleness==-1" and 
SSP as "staleness==N".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2085) Initial version of local backend

2018-05-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:

Description: A single node parameter server acts as a data-parallel 
parameter server. A diagram of the parameter server architecture is shown 
below.  (was: A single node parameter server acts as a data-parallel parameter 
server. And a multi-node model parallel parameter server will be discussed if 
time permits. 

A diagram of the parameter server architecture is shown below.)
Summary: Initial version of local backend  (was: Single-node parameter 
server primitives)

> Initial version of local backend
> 
>
> Key: SYSTEMML-2085
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
> Project: SystemML
>  Issue Type: Technical task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
> Attachments: ps.png
>
>
> A single node parameter server acts as a data-parallel parameter server. A 
> diagram of the parameter server architecture is shown below.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

1 2 3 4 5 >

1 - 100 of 416 matches

Mail list logo