+1 for this proposal. The operator fusion is a very common skill to improve the efficient memory bandwidth and reduce the latency.
My suggestions: * Flexibility Because the fusion, especially pointwise fusion, is backend and device independent. It's better to make the solution more flexible and doesn't need to limit to only GPU backend. Different backend/device can provide the fusion code by themselves by toolchain or optimized kernel with the same path. In the short term, only enable the GPU kernel is fine and I will contribute to add the CPU code soon. * Reuse MXNET_SUBGRAPH_BACKEND env There are already lots of env variable and I suggest to reuse subgraph env which is already well known and documented. Such as MXNET_SUBGRAPH_BACKEND=MKLDNN is for CPU fusion now. https://github.com/apache/incubator-mxnet/blob/master/docs/faq/env_var.md Questions: * " Introduce graph passes that look for subgraphs made of compatible pointwise ops and replace them with proper _FusedOp nodes." What's the "compatible pointwise ops" and is it CDUA version or HW independent? Does the developer need to aware their new OP compatible? * " Fusion is guarded by MXNET_USE_FUSION environment variable. It should be decided what the default should be." Any hints for the user? Is it possible to switch off some of fusion by user or add more? Thanks, BR, --Patric > -----Original Message----- > From: Przemysław Trędak [mailto:ptre...@apache.org] > Sent: Sunday, June 9, 2019 11:57 AM > To: d...@mxnet.apache.org > Subject: Proposal - GPU pointwise fusion > > Hello Community, > > DL models, besides compute intensive operations like convolutions and fully > connected layers, feature a lot of simple pointwise (aka elementwise) > operations (like elementwise addition etc.). Performance of those operations > is fully memory bandwidth bound and so it limits speedups from newer GPU > hardware, which typically has high compute/memory bandwidth ratio. There > are multiple attempts (e.g. TVM) ongoing to use compiler technology in order > to deal with this and other, harder performance problems. However, > integration of e.g. TVM into MXNet is a long term effort and there is a need > for a simpler, more focused, approach to deal with this problem in the > meantime. > > This proposal (design doc [1], PR [2]) attempts to be a short term solution to > this problem - using existing NNVM backend to MXNet and without a big > refactoring required. > > Any feedback and help will be greatly appreciated. > > Thank you, > Przemek > > [1] > https://cwiki.apache.org/confluence/display/MXNET/GPU+Pointwise+fusion > [2] https://github.com/apache/incubator-mxnet/pull/15167