Yeah I want us to look heavily into this problem in the context of deep
learning algorithms. I think we should plan on having first-class support for
DL in our 1.0 release, including efficient (distributed SGD) training (+GPUs)
and efficient distributed scoring. Nice thing too is that when we achieve
this, we'll end up benefiting most of our existing algorithms as well.
--
Mike Dusenberry
GitHub: github.com/dusenberrymw
LinkedIn: linkedin.com/in/mikedusenberry
Sent from my iPhone.
> On Feb 15, 2017, at 12:22 PM, Niketan Pansare wrote:
>
> Hi Matthias,
>
> I am OK with removing this flag, but would prefer that we keep the JIRA open
> until we are sure that caching is not a bottleneck. I have noticed that the
> gradients turns to sparse as we execute more iterations. Also, cache release
> time is dependent on the memory budget. Here are the statistics running Lenet
> on MNIST using
> https://github.com/apache/incubator-systemml/tree/master/scripts/staging/SystemML-NN/examples
>
> With 20G driver memory, the statistics after running 10 epochs are as follows:
> Epoch: 10, Iter: 700, Train Loss: 0.20480149054528493, Train Accuracy:
> 0.984375, Val Loss: 0.026928755962383588, Val Accuracy: 0.9922
> Epoch: 10, Iter: 800, Train Loss: 0.20165772217976913, Train Accuracy: 1.0,
> Val Loss: 0.027878978005867083, Val Accuracy: 0.9922
> 17/02/14 16:06:58 INFO DMLScript: SystemML Statistics:
> Total elapsed time: 12687.863 sec.
> Total compilation time: 2.168 sec.
> Total execution time: 12685.694 sec.
> Number of compiled Spark inst: 147.
> Number of executed Spark inst: 4.
> Cache hits (Mem, WB, FS, HDFS): 1096424/0/0/2.
> Cache writes (WB, FS, HDFS): 603950/15/8.
> Cache times (ACQr/m, RLS, EXP): 3.704/0.336/61.831/1.242 sec.
> HOP DAGs recompiled (PRED, SB): 0/154885.
> HOP DAGs recompile time: 28.663 sec.
> Functions recompiled: 1.
> Functions recompile time: 0.024 sec.
> Spark ctx create time (lazy): 1.009 sec.
> Spark trans counts (par,bc,col):0/0/2.
> Spark trans times (par,bc,col): 0.000/0.000/3.433 secs.
> Total JIT compile time: 44.711 sec.
> Total JVM GC count: 7459.
> Total JVM GC time: 166.26 sec.
> Heavy hitter instructions (name, time, count):
> -- 1) train 12138.979 sec 1
> -- 2) conv2d_bias_add 10876.708 sec 17362
> -- 3) conv2d_backward_filter 421.303 sec 17200
> -- 4) sel+ 239.660 sec 25881
> -- 5) update 226.687 sec 68800
> -- 6) update_nesterov 223.775 sec 68800
> -- 7) maxpooling_backward 136.709 sec 17200
> -- 8) conv2d_backward_data 134.315 sec 8600
> -- 9) ba+* 118.897 sec 51762
> -- 10) relu_maxpooling 112.283 sec 17362
> -- 11) relu_backward 107.483 sec 34400
> -- 12) uack+ 89.258 sec 34400
> -- 13) r' 74.304 sec 43000
> -- 14) +* 57.193 sec 34400
> -- 15) * 16.493 sec 95178
> -- 16) rand 16.038 sec 8613
> -- 17) / 8.352 sec 86492
> -- 18) rangeReIndex 6.628 sec 17208
> -- 19) + 3.054 sec 96528
> -- 20) uark+ 2.219 sec 43241
> -- 21) sp_csvrblk 2.183 sec 2
> -- 22) rmvar 1.517 sec 1451571
> -- 23) write 1.250 sec 9
> -- 24) - 1.059 sec 86486
> -- 25) createvar 1.026 sec 587259
> -- 26) exp 0.663 sec 17281
> -- 27) *2 0.361 sec 2
> -- 28) uasqk+ 0.277 sec 320
> -- 29) log 0.200 sec 160
> -- 30) uarmax 0.191 sec 17281
>
> With 5G driver memory, the statistics after running 10 epochs are as follows:
> Epoch: 10, Iter: 700, Train Loss: 0.19313544015858036, Train Accuracy: 1.0,
> Val Loss: 0.025943927403263182, Val Accuracy: 0.993
> Epoch: 10, Iter: 800, Train Loss: 0.1883995965207449, Train Accuracy: 1.0,
> Val Loss: 0.0260796819319468, Val Accuracy: 0.9916
> 17/02/14 20:16:40 INFO DMLScript: SystemML Statistics:
> Total elapsed time: 13886.763 sec.
> Total compilation time: 2.148 sec.
> Total execution time: 13884.615 sec.
> Number of compiled Spark inst: 147.
> Number of executed Spark inst: 4.
> Cache hits (Mem, WB, FS, HDFS): 1096422/0/2/2.
> Cache writes (WB, FS, HDFS): 603868/2176/8.
> Cache times (ACQr/m, RLS, EXP): 3.883/0.343/271.757/1.312 sec.
> HOP DAGs recompiled (PRED, SB): 0/154885.
> HOP DAGs recompile time: 28.290 sec.
> Functions recompiled: 1.
> Functions recompile time: 0.023 sec.
> Spark ctx create time (lazy): 0.981 sec.
> Spark trans counts (par,bc,col):0/0/2.
> Spark trans times (par,bc,col): 0.000/0.000/3.501 secs.
> Total JIT compile time: 45.131 sec.
> Total JVM GC count: 7605.
> Total JVM GC time: 157.716 sec.
> Heavy hitter instructions (name, time, count):
> -- 1) train 13301.811 sec 1
> -- 2) conv2d_bias_add 11890.291 sec 17362
> -- 3) conv2d_backward_filter 416.645 sec 17200
> -- 4) ba+* 252.966 sec 51762
> -- 5) sel+ 237.334 sec 25881
> -- 6) update 228.261 sec 68800
> -- 7) update_nesterov 225.383 sec 68800
> -- 8) maxpooling_backward 134.260 sec 17200
> -- 9) +* 133.959 sec 34400
> -- 10) conv2d_backward_data 128.046 sec 8600
> -- 11) relu_maxpooling 106.499 sec 17362
> -- 12) relu_backward 104.062 sec 34400
> -- 13) uack+ 90.104 sec 34400
> -- 14) r' 70.932 sec 43000
> -- 15) * 16.203 sec 95178
> -- 16) rand 16.131