Re: Details regarding upcoming PR for runtime TensorRT integration

2018-06-11 Thread Da Zheng
Hello Marek,

Thank you for your detailed design doc. My understanding is that the
current implementation is to convert an NNVM graph to an ONNX graph
and load the ONNX graph to TensorRT.
What is unclear to me is how an operator unsupported by TensorRT is
handled in this strategy. It seems you fall back to the MXNet
operators. Your current solution partitions a graph and loads
subgraphs to TensorRT? If so, why do you need to convert a partitioned
subgraph to ONNX first? If you convert the entire NNVM graph to ONNX,
could you describe in more details how to fall back to MXNet
operators?

Thanks,
Da


On Mon, Jun 11, 2018 at 6:29 PM, Hagay Lupesko  wrote:
> +1 for reviewing a design doc.
>
> Naveen - why do you see it sit under ONNX? Isn't it a broader topic of GPU
> acceleration?
>
> Hagay
>
> On Mon, Jun 11, 2018, 12:56 Naveen Swamy  wrote:
>
>> please add your proposal under design proposals, once the community has
>> reviewed and there is consensus on the approach we can create a ONNX-MXNet
>> sub section and move there.
>>
>> On Mon, Jun 11, 2018 at 9:54 PM, Naveen Swamy  wrote:
>>
>> > you have access now.
>> >
>> > On Mon, Jun 11, 2018 at 8:34 PM, Naveen Swamy 
>> wrote:
>> >
>> >> I'll add in about an hour
>> >>
>> >> > On Jun 11, 2018, at 8:12 PM, Marco de Abreu <
>> >> marco.g.ab...@googlemail.com> wrote:
>> >> >
>> >> > I don't know how to grant permission on Confluence. If somebody else
>> >> knows
>> >> > how to do so, please grant Marek the edit permissions.
>> >> >
>> >> > -Marco
>> >> >
>> >> >> On Mon, Jun 11, 2018 at 11:05 AM Marek Kolodziej 
>> >> wrote:
>> >> >>
>> >> >> Hi Rajan,
>> >> >>
>> >> >> I wanted to share on Confluence, but it didn't allow me to create a
>> new
>> >> >> document. If my e-mail address gets permissions to add new Confluence
>> >> >> pages, I'll transfer the contents to Confluence. Please keep me
>> posted
>> >> when
>> >> >> I get edit permissions.
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> Marek
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Mon, Jun 11, 2018 at 11:02 AM singh.raja...@gmail.com <
>> >> >> singh.raja...@gmail.com> wrote:
>> >> >>
>> >> >>> HI Marek,
>> >> >>>
>> >> >>> Thanks for sharing the  document. It would be great if you could
>> >> share it
>> >> >>> on confluence wiki or a quip document. The formatting here makes it
>> >> very
>> >> >>> difficult to read a long document.
>> >> >>>
>> >> >>> Appreciate the help.
>> >> >>>
>> >> >>> Thanks
>> >> >>> Rajan
>> >> >>>
>> >>  On 2018/06/11 17:50:26, Marek Kolodziej  wrote:
>> >>  *Hi everyone,This is a quick summary of NVIDIA’s plans for
>> >> >> open-sourcing
>> >> >>> an
>> >>  initial integration of TensorRT as a runtime accelerator of MxNet
>> (PR
>> >> >> for
>> >>  discussion coming in the next few days, ETA of the first draft of
>> the
>> >> >> PR
>> >> >>> is
>> >>  this Friday or even earlier). Feedback is appreciated.Best,Marek
>> >>  KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT
>> >> >> provides
>> >>  significant acceleration of model inference on NVIDIA GPUs compared
>> >> to
>> >>  running the full graph in MxNet using unfused GPU operators. In
>> >> >> addition
>> >> >>> to
>> >>  faster fp32 inference, TensorRT optimizes fp16 inference, and is
>> >> >> capable
>> >> >>> of
>> >>  int8 inference (provided the quantization steps are performed).
>> >> Besides
>> >>  increasing throughput, TensorRT significantly reduces inference
>> >> >> latency,
>> >>  especially for small batches. See more here
>> >>  .2. Despite its benefits,
>> >> using
>> >>  pre-trained models with TensorRT typically requires some effort -
>> >> >> either
>> >>  re-writing the model using TensorRT’s graph building APIs, or
>> >> >> exporting a
>> >>  model to ONNX, followed by an import step. Even if the import is
>> >> >>> simplified
>> >>  using ONNX, the TensorRT user still needs to provide their own data
>> >>  pipeline, which used to exist in the framework, but no longer does
>> >> in a
>> >>  stand-alone TensorRT deployment with a client application.3.
>> TensorRT
>> >> >> is
>> >>  very performant, but does not have the full set of MxNet’s
>> operators.
>> >> >>> While
>> >>  that could be addressed with TensorRT plugins, it’s much simpler to
>> >> >> reuse
>> >>  already-exisitng MxNet operators. Also, the user shouldn’t care
>> about
>> >>  knowing which operators are supported by TensorRT and which ones
>> >> >> aren’t -
>> >>  runtime integration allows the graph partitioner to extract
>> subgraphs
>> >>  capable of running inside of TensorRT, place the subgraph in a
>> >> TensorRT
>> >>  operator in MxNet, execute that operator as part of MxNet’s graph
>> >>  execusion, and handle non-TensorRT-compatible nodes as regular
>> MxNet
>> >>  operators remaining after the TensorRT subgraph extraction and node
>> >>  substitution. The 

Re: Details regarding upcoming PR for runtime TensorRT integration

2018-06-11 Thread Hagay Lupesko
+1 for reviewing a design doc.

Naveen - why do you see it sit under ONNX? Isn't it a broader topic of GPU
acceleration?

Hagay

On Mon, Jun 11, 2018, 12:56 Naveen Swamy  wrote:

> please add your proposal under design proposals, once the community has
> reviewed and there is consensus on the approach we can create a ONNX-MXNet
> sub section and move there.
>
> On Mon, Jun 11, 2018 at 9:54 PM, Naveen Swamy  wrote:
>
> > you have access now.
> >
> > On Mon, Jun 11, 2018 at 8:34 PM, Naveen Swamy 
> wrote:
> >
> >> I'll add in about an hour
> >>
> >> > On Jun 11, 2018, at 8:12 PM, Marco de Abreu <
> >> marco.g.ab...@googlemail.com> wrote:
> >> >
> >> > I don't know how to grant permission on Confluence. If somebody else
> >> knows
> >> > how to do so, please grant Marek the edit permissions.
> >> >
> >> > -Marco
> >> >
> >> >> On Mon, Jun 11, 2018 at 11:05 AM Marek Kolodziej 
> >> wrote:
> >> >>
> >> >> Hi Rajan,
> >> >>
> >> >> I wanted to share on Confluence, but it didn't allow me to create a
> new
> >> >> document. If my e-mail address gets permissions to add new Confluence
> >> >> pages, I'll transfer the contents to Confluence. Please keep me
> posted
> >> when
> >> >> I get edit permissions.
> >> >>
> >> >> Thanks!
> >> >>
> >> >> Marek
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Jun 11, 2018 at 11:02 AM singh.raja...@gmail.com <
> >> >> singh.raja...@gmail.com> wrote:
> >> >>
> >> >>> HI Marek,
> >> >>>
> >> >>> Thanks for sharing the  document. It would be great if you could
> >> share it
> >> >>> on confluence wiki or a quip document. The formatting here makes it
> >> very
> >> >>> difficult to read a long document.
> >> >>>
> >> >>> Appreciate the help.
> >> >>>
> >> >>> Thanks
> >> >>> Rajan
> >> >>>
> >>  On 2018/06/11 17:50:26, Marek Kolodziej  wrote:
> >>  *Hi everyone,This is a quick summary of NVIDIA’s plans for
> >> >> open-sourcing
> >> >>> an
> >>  initial integration of TensorRT as a runtime accelerator of MxNet
> (PR
> >> >> for
> >>  discussion coming in the next few days, ETA of the first draft of
> the
> >> >> PR
> >> >>> is
> >>  this Friday or even earlier). Feedback is appreciated.Best,Marek
> >>  KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT
> >> >> provides
> >>  significant acceleration of model inference on NVIDIA GPUs compared
> >> to
> >>  running the full graph in MxNet using unfused GPU operators. In
> >> >> addition
> >> >>> to
> >>  faster fp32 inference, TensorRT optimizes fp16 inference, and is
> >> >> capable
> >> >>> of
> >>  int8 inference (provided the quantization steps are performed).
> >> Besides
> >>  increasing throughput, TensorRT significantly reduces inference
> >> >> latency,
> >>  especially for small batches. See more here
> >>  .2. Despite its benefits,
> >> using
> >>  pre-trained models with TensorRT typically requires some effort -
> >> >> either
> >>  re-writing the model using TensorRT’s graph building APIs, or
> >> >> exporting a
> >>  model to ONNX, followed by an import step. Even if the import is
> >> >>> simplified
> >>  using ONNX, the TensorRT user still needs to provide their own data
> >>  pipeline, which used to exist in the framework, but no longer does
> >> in a
> >>  stand-alone TensorRT deployment with a client application.3.
> TensorRT
> >> >> is
> >>  very performant, but does not have the full set of MxNet’s
> operators.
> >> >>> While
> >>  that could be addressed with TensorRT plugins, it’s much simpler to
> >> >> reuse
> >>  already-exisitng MxNet operators. Also, the user shouldn’t care
> about
> >>  knowing which operators are supported by TensorRT and which ones
> >> >> aren’t -
> >>  runtime integration allows the graph partitioner to extract
> subgraphs
> >>  capable of running inside of TensorRT, place the subgraph in a
> >> TensorRT
> >>  operator in MxNet, execute that operator as part of MxNet’s graph
> >>  execusion, and handle non-TensorRT-compatible nodes as regular
> MxNet
> >>  operators remaining after the TensorRT subgraph extraction and node
> >>  substitution. The goal is to accelerate inference without changing
> >> user
> >>  experience.Design considerations 1. Since TensorRT can only
> determine
> >> >> all
> >>  possible optimizations once the tensor shapes are known, it is
> >> >> imperative
> >>  that all the shape information be provided. This means that the
> best
> >> >> time
> >>  to construct the TensorRT graph is bind time. The coming PR can
> >> >>> selectively
> >>  apply the TensorRT optimization for inference-only graphs at symbol
> >> >> bind
> >>  time. This is in fact consistent with the assumptions about
> TensorRT
> >> >> made
> >>  on the MxNet Wiki here
> >>  <
> >> >>>
> >> >> https://cwiki.apache.org/confluence/display/MXNET/Unified+
> >> integration+with+external+acceleration+libraries
> 

Re: Details regarding upcoming PR for runtime TensorRT integration

2018-06-11 Thread Naveen Swamy
please add your proposal under design proposals, once the community has
reviewed and there is consensus on the approach we can create a ONNX-MXNet
sub section and move there.

On Mon, Jun 11, 2018 at 9:54 PM, Naveen Swamy  wrote:

> you have access now.
>
> On Mon, Jun 11, 2018 at 8:34 PM, Naveen Swamy  wrote:
>
>> I'll add in about an hour
>>
>> > On Jun 11, 2018, at 8:12 PM, Marco de Abreu <
>> marco.g.ab...@googlemail.com> wrote:
>> >
>> > I don't know how to grant permission on Confluence. If somebody else
>> knows
>> > how to do so, please grant Marek the edit permissions.
>> >
>> > -Marco
>> >
>> >> On Mon, Jun 11, 2018 at 11:05 AM Marek Kolodziej 
>> wrote:
>> >>
>> >> Hi Rajan,
>> >>
>> >> I wanted to share on Confluence, but it didn't allow me to create a new
>> >> document. If my e-mail address gets permissions to add new Confluence
>> >> pages, I'll transfer the contents to Confluence. Please keep me posted
>> when
>> >> I get edit permissions.
>> >>
>> >> Thanks!
>> >>
>> >> Marek
>> >>
>> >>
>> >>
>> >> On Mon, Jun 11, 2018 at 11:02 AM singh.raja...@gmail.com <
>> >> singh.raja...@gmail.com> wrote:
>> >>
>> >>> HI Marek,
>> >>>
>> >>> Thanks for sharing the  document. It would be great if you could
>> share it
>> >>> on confluence wiki or a quip document. The formatting here makes it
>> very
>> >>> difficult to read a long document.
>> >>>
>> >>> Appreciate the help.
>> >>>
>> >>> Thanks
>> >>> Rajan
>> >>>
>>  On 2018/06/11 17:50:26, Marek Kolodziej  wrote:
>>  *Hi everyone,This is a quick summary of NVIDIA’s plans for
>> >> open-sourcing
>> >>> an
>>  initial integration of TensorRT as a runtime accelerator of MxNet (PR
>> >> for
>>  discussion coming in the next few days, ETA of the first draft of the
>> >> PR
>> >>> is
>>  this Friday or even earlier). Feedback is appreciated.Best,Marek
>>  KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT
>> >> provides
>>  significant acceleration of model inference on NVIDIA GPUs compared
>> to
>>  running the full graph in MxNet using unfused GPU operators. In
>> >> addition
>> >>> to
>>  faster fp32 inference, TensorRT optimizes fp16 inference, and is
>> >> capable
>> >>> of
>>  int8 inference (provided the quantization steps are performed).
>> Besides
>>  increasing throughput, TensorRT significantly reduces inference
>> >> latency,
>>  especially for small batches. See more here
>>  .2. Despite its benefits,
>> using
>>  pre-trained models with TensorRT typically requires some effort -
>> >> either
>>  re-writing the model using TensorRT’s graph building APIs, or
>> >> exporting a
>>  model to ONNX, followed by an import step. Even if the import is
>> >>> simplified
>>  using ONNX, the TensorRT user still needs to provide their own data
>>  pipeline, which used to exist in the framework, but no longer does
>> in a
>>  stand-alone TensorRT deployment with a client application.3. TensorRT
>> >> is
>>  very performant, but does not have the full set of MxNet’s operators.
>> >>> While
>>  that could be addressed with TensorRT plugins, it’s much simpler to
>> >> reuse
>>  already-exisitng MxNet operators. Also, the user shouldn’t care about
>>  knowing which operators are supported by TensorRT and which ones
>> >> aren’t -
>>  runtime integration allows the graph partitioner to extract subgraphs
>>  capable of running inside of TensorRT, place the subgraph in a
>> TensorRT
>>  operator in MxNet, execute that operator as part of MxNet’s graph
>>  execusion, and handle non-TensorRT-compatible nodes as regular MxNet
>>  operators remaining after the TensorRT subgraph extraction and node
>>  substitution. The goal is to accelerate inference without changing
>> user
>>  experience.Design considerations 1. Since TensorRT can only determine
>> >> all
>>  possible optimizations once the tensor shapes are known, it is
>> >> imperative
>>  that all the shape information be provided. This means that the best
>> >> time
>>  to construct the TensorRT graph is bind time. The coming PR can
>> >>> selectively
>>  apply the TensorRT optimization for inference-only graphs at symbol
>> >> bind
>>  time. This is in fact consistent with the assumptions about TensorRT
>> >> made
>>  on the MxNet Wiki here
>>  <
>> >>>
>> >> https://cwiki.apache.org/confluence/display/MXNET/Unified+
>> integration+with+external+acceleration+libraries
>>  .
>>  2. Since as mentioned in #1, TensorRT graph building needs shape
>>  information only available at bind time, an important goal was not to
>>  disrupt any existing APIs. Even though C++ permits default function
>>  arguments, the Python bindings for symbol-related methods (e.g.
>> simple
>>  bind) are exposed via a C, not C++, API, wired on the Python side
>> using
>>  Ctypes (e.g. see here
>>  <
>> >>>
>> 

Re: Details regarding upcoming PR for runtime TensorRT integration

2018-06-11 Thread Naveen Swamy
you have access now.

On Mon, Jun 11, 2018 at 8:34 PM, Naveen Swamy  wrote:

> I'll add in about an hour
>
> > On Jun 11, 2018, at 8:12 PM, Marco de Abreu <
> marco.g.ab...@googlemail.com> wrote:
> >
> > I don't know how to grant permission on Confluence. If somebody else
> knows
> > how to do so, please grant Marek the edit permissions.
> >
> > -Marco
> >
> >> On Mon, Jun 11, 2018 at 11:05 AM Marek Kolodziej 
> wrote:
> >>
> >> Hi Rajan,
> >>
> >> I wanted to share on Confluence, but it didn't allow me to create a new
> >> document. If my e-mail address gets permissions to add new Confluence
> >> pages, I'll transfer the contents to Confluence. Please keep me posted
> when
> >> I get edit permissions.
> >>
> >> Thanks!
> >>
> >> Marek
> >>
> >>
> >>
> >> On Mon, Jun 11, 2018 at 11:02 AM singh.raja...@gmail.com <
> >> singh.raja...@gmail.com> wrote:
> >>
> >>> HI Marek,
> >>>
> >>> Thanks for sharing the  document. It would be great if you could share
> it
> >>> on confluence wiki or a quip document. The formatting here makes it
> very
> >>> difficult to read a long document.
> >>>
> >>> Appreciate the help.
> >>>
> >>> Thanks
> >>> Rajan
> >>>
>  On 2018/06/11 17:50:26, Marek Kolodziej  wrote:
>  *Hi everyone,This is a quick summary of NVIDIA’s plans for
> >> open-sourcing
> >>> an
>  initial integration of TensorRT as a runtime accelerator of MxNet (PR
> >> for
>  discussion coming in the next few days, ETA of the first draft of the
> >> PR
> >>> is
>  this Friday or even earlier). Feedback is appreciated.Best,Marek
>  KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT
> >> provides
>  significant acceleration of model inference on NVIDIA GPUs compared to
>  running the full graph in MxNet using unfused GPU operators. In
> >> addition
> >>> to
>  faster fp32 inference, TensorRT optimizes fp16 inference, and is
> >> capable
> >>> of
>  int8 inference (provided the quantization steps are performed).
> Besides
>  increasing throughput, TensorRT significantly reduces inference
> >> latency,
>  especially for small batches. See more here
>  .2. Despite its benefits,
> using
>  pre-trained models with TensorRT typically requires some effort -
> >> either
>  re-writing the model using TensorRT’s graph building APIs, or
> >> exporting a
>  model to ONNX, followed by an import step. Even if the import is
> >>> simplified
>  using ONNX, the TensorRT user still needs to provide their own data
>  pipeline, which used to exist in the framework, but no longer does in
> a
>  stand-alone TensorRT deployment with a client application.3. TensorRT
> >> is
>  very performant, but does not have the full set of MxNet’s operators.
> >>> While
>  that could be addressed with TensorRT plugins, it’s much simpler to
> >> reuse
>  already-exisitng MxNet operators. Also, the user shouldn’t care about
>  knowing which operators are supported by TensorRT and which ones
> >> aren’t -
>  runtime integration allows the graph partitioner to extract subgraphs
>  capable of running inside of TensorRT, place the subgraph in a
> TensorRT
>  operator in MxNet, execute that operator as part of MxNet’s graph
>  execusion, and handle non-TensorRT-compatible nodes as regular MxNet
>  operators remaining after the TensorRT subgraph extraction and node
>  substitution. The goal is to accelerate inference without changing
> user
>  experience.Design considerations 1. Since TensorRT can only determine
> >> all
>  possible optimizations once the tensor shapes are known, it is
> >> imperative
>  that all the shape information be provided. This means that the best
> >> time
>  to construct the TensorRT graph is bind time. The coming PR can
> >>> selectively
>  apply the TensorRT optimization for inference-only graphs at symbol
> >> bind
>  time. This is in fact consistent with the assumptions about TensorRT
> >> made
>  on the MxNet Wiki here
>  <
> >>>
> >> https://cwiki.apache.org/confluence/display/MXNET/
> Unified+integration+with+external+acceleration+libraries
>  .
>  2. Since as mentioned in #1, TensorRT graph building needs shape
>  information only available at bind time, an important goal was not to
>  disrupt any existing APIs. Even though C++ permits default function
>  arguments, the Python bindings for symbol-related methods (e.g. simple
>  bind) are exposed via a C, not C++, API, wired on the Python side
> using
>  Ctypes (e.g. see here
>  <
> >>>
> >> https://github.com/apache/incubator-mxnet/blob/master/
> python/mxnet/symbol/symbol.py#L1486:L1521
> 
>  for the simple bind integration). This precludes the addition of extra
>  arguments without causing breaking changes in the C API. Also,
> adapting
> >>> the
>  Python code to such changes wouldn’t be enough, since all frontend
>  

Re: Details regarding upcoming PR for runtime TensorRT integration

2018-06-11 Thread Naveen Swamy
I'll add in about an hour

> On Jun 11, 2018, at 8:12 PM, Marco de Abreu  
> wrote:
> 
> I don't know how to grant permission on Confluence. If somebody else knows
> how to do so, please grant Marek the edit permissions.
> 
> -Marco
> 
>> On Mon, Jun 11, 2018 at 11:05 AM Marek Kolodziej  wrote:
>> 
>> Hi Rajan,
>> 
>> I wanted to share on Confluence, but it didn't allow me to create a new
>> document. If my e-mail address gets permissions to add new Confluence
>> pages, I'll transfer the contents to Confluence. Please keep me posted when
>> I get edit permissions.
>> 
>> Thanks!
>> 
>> Marek
>> 
>> 
>> 
>> On Mon, Jun 11, 2018 at 11:02 AM singh.raja...@gmail.com <
>> singh.raja...@gmail.com> wrote:
>> 
>>> HI Marek,
>>> 
>>> Thanks for sharing the  document. It would be great if you could share it
>>> on confluence wiki or a quip document. The formatting here makes it very
>>> difficult to read a long document.
>>> 
>>> Appreciate the help.
>>> 
>>> Thanks
>>> Rajan
>>> 
 On 2018/06/11 17:50:26, Marek Kolodziej  wrote:
 *Hi everyone,This is a quick summary of NVIDIA’s plans for
>> open-sourcing
>>> an
 initial integration of TensorRT as a runtime accelerator of MxNet (PR
>> for
 discussion coming in the next few days, ETA of the first draft of the
>> PR
>>> is
 this Friday or even earlier). Feedback is appreciated.Best,Marek
 KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT
>> provides
 significant acceleration of model inference on NVIDIA GPUs compared to
 running the full graph in MxNet using unfused GPU operators. In
>> addition
>>> to
 faster fp32 inference, TensorRT optimizes fp16 inference, and is
>> capable
>>> of
 int8 inference (provided the quantization steps are performed). Besides
 increasing throughput, TensorRT significantly reduces inference
>> latency,
 especially for small batches. See more here
 .2. Despite its benefits, using
 pre-trained models with TensorRT typically requires some effort -
>> either
 re-writing the model using TensorRT’s graph building APIs, or
>> exporting a
 model to ONNX, followed by an import step. Even if the import is
>>> simplified
 using ONNX, the TensorRT user still needs to provide their own data
 pipeline, which used to exist in the framework, but no longer does in a
 stand-alone TensorRT deployment with a client application.3. TensorRT
>> is
 very performant, but does not have the full set of MxNet’s operators.
>>> While
 that could be addressed with TensorRT plugins, it’s much simpler to
>> reuse
 already-exisitng MxNet operators. Also, the user shouldn’t care about
 knowing which operators are supported by TensorRT and which ones
>> aren’t -
 runtime integration allows the graph partitioner to extract subgraphs
 capable of running inside of TensorRT, place the subgraph in a TensorRT
 operator in MxNet, execute that operator as part of MxNet’s graph
 execusion, and handle non-TensorRT-compatible nodes as regular MxNet
 operators remaining after the TensorRT subgraph extraction and node
 substitution. The goal is to accelerate inference without changing user
 experience.Design considerations 1. Since TensorRT can only determine
>> all
 possible optimizations once the tensor shapes are known, it is
>> imperative
 that all the shape information be provided. This means that the best
>> time
 to construct the TensorRT graph is bind time. The coming PR can
>>> selectively
 apply the TensorRT optimization for inference-only graphs at symbol
>> bind
 time. This is in fact consistent with the assumptions about TensorRT
>> made
 on the MxNet Wiki here
 <
>>> 
>> https://cwiki.apache.org/confluence/display/MXNET/Unified+integration+with+external+acceleration+libraries
 .
 2. Since as mentioned in #1, TensorRT graph building needs shape
 information only available at bind time, an important goal was not to
 disrupt any existing APIs. Even though C++ permits default function
 arguments, the Python bindings for symbol-related methods (e.g. simple
 bind) are exposed via a C, not C++, API, wired on the Python side using
 Ctypes (e.g. see here
 <
>>> 
>> https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/symbol/symbol.py#L1486:L1521
 
 for the simple bind integration). This precludes the addition of extra
 arguments without causing breaking changes in the C API. Also, adapting
>>> the
 Python code to such changes wouldn’t be enough, since all frontend
 languages use the C (not C++) API for the FFI. Fortunately, C API
>> changes
 could be avoided, by simply letting the user enable or disable the
>>> TensorRT
 pass using an environment variable (USE_TENSORRT=1 to enable). This
>> also
 does not diminish the flexibility of the integration, since the graph
>>> pass
 can read the environment 

Re: Details regarding upcoming PR for runtime TensorRT integration

2018-06-11 Thread Marco de Abreu
I don't know how to grant permission on Confluence. If somebody else knows
how to do so, please grant Marek the edit permissions.

-Marco

On Mon, Jun 11, 2018 at 11:05 AM Marek Kolodziej  wrote:

> Hi Rajan,
>
> I wanted to share on Confluence, but it didn't allow me to create a new
> document. If my e-mail address gets permissions to add new Confluence
> pages, I'll transfer the contents to Confluence. Please keep me posted when
> I get edit permissions.
>
> Thanks!
>
> Marek
>
>
>
> On Mon, Jun 11, 2018 at 11:02 AM singh.raja...@gmail.com <
> singh.raja...@gmail.com> wrote:
>
> > HI Marek,
> >
> > Thanks for sharing the  document. It would be great if you could share it
> > on confluence wiki or a quip document. The formatting here makes it very
> > difficult to read a long document.
> >
> > Appreciate the help.
> >
> > Thanks
> > Rajan
> >
> > On 2018/06/11 17:50:26, Marek Kolodziej  wrote:
> > > *Hi everyone,This is a quick summary of NVIDIA’s plans for
> open-sourcing
> > an
> > > initial integration of TensorRT as a runtime accelerator of MxNet (PR
> for
> > > discussion coming in the next few days, ETA of the first draft of the
> PR
> > is
> > > this Friday or even earlier). Feedback is appreciated.Best,Marek
> > > KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT
> provides
> > > significant acceleration of model inference on NVIDIA GPUs compared to
> > > running the full graph in MxNet using unfused GPU operators. In
> addition
> > to
> > > faster fp32 inference, TensorRT optimizes fp16 inference, and is
> capable
> > of
> > > int8 inference (provided the quantization steps are performed). Besides
> > > increasing throughput, TensorRT significantly reduces inference
> latency,
> > > especially for small batches. See more here
> > > .2. Despite its benefits, using
> > > pre-trained models with TensorRT typically requires some effort -
> either
> > > re-writing the model using TensorRT’s graph building APIs, or
> exporting a
> > > model to ONNX, followed by an import step. Even if the import is
> > simplified
> > > using ONNX, the TensorRT user still needs to provide their own data
> > > pipeline, which used to exist in the framework, but no longer does in a
> > > stand-alone TensorRT deployment with a client application.3. TensorRT
> is
> > > very performant, but does not have the full set of MxNet’s operators.
> > While
> > > that could be addressed with TensorRT plugins, it’s much simpler to
> reuse
> > > already-exisitng MxNet operators. Also, the user shouldn’t care about
> > > knowing which operators are supported by TensorRT and which ones
> aren’t -
> > > runtime integration allows the graph partitioner to extract subgraphs
> > > capable of running inside of TensorRT, place the subgraph in a TensorRT
> > > operator in MxNet, execute that operator as part of MxNet’s graph
> > > execusion, and handle non-TensorRT-compatible nodes as regular MxNet
> > > operators remaining after the TensorRT subgraph extraction and node
> > > substitution. The goal is to accelerate inference without changing user
> > > experience.Design considerations 1. Since TensorRT can only determine
> all
> > > possible optimizations once the tensor shapes are known, it is
> imperative
> > > that all the shape information be provided. This means that the best
> time
> > > to construct the TensorRT graph is bind time. The coming PR can
> > selectively
> > > apply the TensorRT optimization for inference-only graphs at symbol
> bind
> > > time. This is in fact consistent with the assumptions about TensorRT
> made
> > > on the MxNet Wiki here
> > > <
> >
> https://cwiki.apache.org/confluence/display/MXNET/Unified+integration+with+external+acceleration+libraries
> > >.
> > > 2. Since as mentioned in #1, TensorRT graph building needs shape
> > > information only available at bind time, an important goal was not to
> > > disrupt any existing APIs. Even though C++ permits default function
> > > arguments, the Python bindings for symbol-related methods (e.g. simple
> > > bind) are exposed via a C, not C++, API, wired on the Python side using
> > > Ctypes (e.g. see here
> > > <
> >
> https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/symbol/symbol.py#L1486:L1521
> > >
> > > for the simple bind integration). This precludes the addition of extra
> > > arguments without causing breaking changes in the C API. Also, adapting
> > the
> > > Python code to such changes wouldn’t be enough, since all frontend
> > > languages use the C (not C++) API for the FFI. Fortunately, C API
> changes
> > > could be avoided, by simply letting the user enable or disable the
> > TensorRT
> > > pass using an environment variable (USE_TENSORRT=1 to enable). This
> also
> > > does not diminish the flexibility of the integration, since the graph
> > pass
> > > can read the environment variable each time symbol binding is done, and
> > > hence permits turning the graph passes on and off, 

Re: Details regarding upcoming PR for runtime TensorRT integration

2018-06-11 Thread Marek Kolodziej
Hi Rajan,

I wanted to share on Confluence, but it didn't allow me to create a new
document. If my e-mail address gets permissions to add new Confluence
pages, I'll transfer the contents to Confluence. Please keep me posted when
I get edit permissions.

Thanks!

Marek



On Mon, Jun 11, 2018 at 11:02 AM singh.raja...@gmail.com <
singh.raja...@gmail.com> wrote:

> HI Marek,
>
> Thanks for sharing the  document. It would be great if you could share it
> on confluence wiki or a quip document. The formatting here makes it very
> difficult to read a long document.
>
> Appreciate the help.
>
> Thanks
> Rajan
>
> On 2018/06/11 17:50:26, Marek Kolodziej  wrote:
> > *Hi everyone,This is a quick summary of NVIDIA’s plans for open-sourcing
> an
> > initial integration of TensorRT as a runtime accelerator of MxNet (PR for
> > discussion coming in the next few days, ETA of the first draft of the PR
> is
> > this Friday or even earlier). Feedback is appreciated.Best,Marek
> > KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT provides
> > significant acceleration of model inference on NVIDIA GPUs compared to
> > running the full graph in MxNet using unfused GPU operators. In addition
> to
> > faster fp32 inference, TensorRT optimizes fp16 inference, and is capable
> of
> > int8 inference (provided the quantization steps are performed). Besides
> > increasing throughput, TensorRT significantly reduces inference latency,
> > especially for small batches. See more here
> > .2. Despite its benefits, using
> > pre-trained models with TensorRT typically requires some effort - either
> > re-writing the model using TensorRT’s graph building APIs, or exporting a
> > model to ONNX, followed by an import step. Even if the import is
> simplified
> > using ONNX, the TensorRT user still needs to provide their own data
> > pipeline, which used to exist in the framework, but no longer does in a
> > stand-alone TensorRT deployment with a client application.3. TensorRT is
> > very performant, but does not have the full set of MxNet’s operators.
> While
> > that could be addressed with TensorRT plugins, it’s much simpler to reuse
> > already-exisitng MxNet operators. Also, the user shouldn’t care about
> > knowing which operators are supported by TensorRT and which ones aren’t -
> > runtime integration allows the graph partitioner to extract subgraphs
> > capable of running inside of TensorRT, place the subgraph in a TensorRT
> > operator in MxNet, execute that operator as part of MxNet’s graph
> > execusion, and handle non-TensorRT-compatible nodes as regular MxNet
> > operators remaining after the TensorRT subgraph extraction and node
> > substitution. The goal is to accelerate inference without changing user
> > experience.Design considerations 1. Since TensorRT can only determine all
> > possible optimizations once the tensor shapes are known, it is imperative
> > that all the shape information be provided. This means that the best time
> > to construct the TensorRT graph is bind time. The coming PR can
> selectively
> > apply the TensorRT optimization for inference-only graphs at symbol bind
> > time. This is in fact consistent with the assumptions about TensorRT made
> > on the MxNet Wiki here
> > <
> https://cwiki.apache.org/confluence/display/MXNET/Unified+integration+with+external+acceleration+libraries
> >.
> > 2. Since as mentioned in #1, TensorRT graph building needs shape
> > information only available at bind time, an important goal was not to
> > disrupt any existing APIs. Even though C++ permits default function
> > arguments, the Python bindings for symbol-related methods (e.g. simple
> > bind) are exposed via a C, not C++, API, wired on the Python side using
> > Ctypes (e.g. see here
> > <
> https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/symbol/symbol.py#L1486:L1521
> >
> > for the simple bind integration). This precludes the addition of extra
> > arguments without causing breaking changes in the C API. Also, adapting
> the
> > Python code to such changes wouldn’t be enough, since all frontend
> > languages use the C (not C++) API for the FFI. Fortunately, C API changes
> > could be avoided, by simply letting the user enable or disable the
> TensorRT
> > pass using an environment variable (USE_TENSORRT=1 to enable). This also
> > does not diminish the flexibility of the integration, since the graph
> pass
> > can read the environment variable each time symbol binding is done, and
> > hence permits turning the graph passes on and off, depending on need. The
> > ability to enable and disable the TensorRT pass at runtime also makes
> unit
> > testing easier.3. TensorRT requires that the workspace size is provided
> at
> > graph construction time. This value constitutes the upper limit on the
> > amount of memory that TensorRT can use, and does not determine immediate
> > use. Since this amount can be hard for the user to know, its limit should
> > be 

Re: Details regarding upcoming PR for runtime TensorRT integration

2018-06-11 Thread singh . rajan28
HI Marek,

Thanks for sharing the  document. It would be great if you could share it on 
confluence wiki or a quip document. The formatting here makes it very difficult 
to read a long document.

Appreciate the help.

Thanks
Rajan  

On 2018/06/11 17:50:26, Marek Kolodziej  wrote: 
> *Hi everyone,This is a quick summary of NVIDIA’s plans for open-sourcing an
> initial integration of TensorRT as a runtime accelerator of MxNet (PR for
> discussion coming in the next few days, ETA of the first draft of the PR is
> this Friday or even earlier). Feedback is appreciated.Best,Marek
> KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT provides
> significant acceleration of model inference on NVIDIA GPUs compared to
> running the full graph in MxNet using unfused GPU operators. In addition to
> faster fp32 inference, TensorRT optimizes fp16 inference, and is capable of
> int8 inference (provided the quantization steps are performed). Besides
> increasing throughput, TensorRT significantly reduces inference latency,
> especially for small batches. See more here
> .2. Despite its benefits, using
> pre-trained models with TensorRT typically requires some effort - either
> re-writing the model using TensorRT’s graph building APIs, or exporting a
> model to ONNX, followed by an import step. Even if the import is simplified
> using ONNX, the TensorRT user still needs to provide their own data
> pipeline, which used to exist in the framework, but no longer does in a
> stand-alone TensorRT deployment with a client application.3. TensorRT is
> very performant, but does not have the full set of MxNet’s operators. While
> that could be addressed with TensorRT plugins, it’s much simpler to reuse
> already-exisitng MxNet operators. Also, the user shouldn’t care about
> knowing which operators are supported by TensorRT and which ones aren’t -
> runtime integration allows the graph partitioner to extract subgraphs
> capable of running inside of TensorRT, place the subgraph in a TensorRT
> operator in MxNet, execute that operator as part of MxNet’s graph
> execusion, and handle non-TensorRT-compatible nodes as regular MxNet
> operators remaining after the TensorRT subgraph extraction and node
> substitution. The goal is to accelerate inference without changing user
> experience.Design considerations 1. Since TensorRT can only determine all
> possible optimizations once the tensor shapes are known, it is imperative
> that all the shape information be provided. This means that the best time
> to construct the TensorRT graph is bind time. The coming PR can selectively
> apply the TensorRT optimization for inference-only graphs at symbol bind
> time. This is in fact consistent with the assumptions about TensorRT made
> on the MxNet Wiki here
> .
> 2. Since as mentioned in #1, TensorRT graph building needs shape
> information only available at bind time, an important goal was not to
> disrupt any existing APIs. Even though C++ permits default function
> arguments, the Python bindings for symbol-related methods (e.g. simple
> bind) are exposed via a C, not C++, API, wired on the Python side using
> Ctypes (e.g. see here
> 
> for the simple bind integration). This precludes the addition of extra
> arguments without causing breaking changes in the C API. Also, adapting the
> Python code to such changes wouldn’t be enough, since all frontend
> languages use the C (not C++) API for the FFI. Fortunately, C API changes
> could be avoided, by simply letting the user enable or disable the TensorRT
> pass using an environment variable (USE_TENSORRT=1 to enable). This also
> does not diminish the flexibility of the integration, since the graph pass
> can read the environment variable each time symbol binding is done, and
> hence permits turning the graph passes on and off, depending on need. The
> ability to enable and disable the TensorRT pass at runtime also makes unit
> testing easier.3. TensorRT requires that the workspace size is provided at
> graph construction time. This value constitutes the upper limit on the
> amount of memory that TensorRT can use, and does not determine immediate
> use. Since this amount can be hard for the user to know, its limit should
> be set to a reasonable value that the user need not concern themselves
> with. Given that TensorRT integration is applied at bind time and that
> TensorRT engines wrapped in TensorRT nodes are constructed during the graph
> pass rather than the memory allocation pass,  MxNet will only allocate the
> amount needed for the nodes remaining after the TensorRT subgraphs have
> been extracted. This means that no memory will be doubly allocated - first
> for the complete MxNet subgraph and then for TensorRT. However, the
> question 

Re: Details regarding upcoming PR for runtime TensorRT integration

2018-06-11 Thread Marek Kolodziej
Hi Marco,

Sorry for the formatting being lost.

Here's the original Google doc. I actually wanted to originally use
Confluence, but I didn't have permissions to edit, so here goes.

https://docs.google.com/document/d/1UbsUacxWRKXCEE6v0r4VmKL76QLmFQYgMyAcQP0I8U0/edit?usp=sharing

Best,

Marek



On Mon, Jun 11, 2018 at 10:54 AM Marco de Abreu <
marco.g.ab...@googlemail.com> wrote:

> Hello Marek,
>
> this sounds great! Definitely looking forward to it.
>
> It seems like our mailing list destroyed your formatting. You might want to
> consider putting it into a Google Docs document or uploading it to
> confluence.
>
> Best regards,
> Marco
>
> On Mon, Jun 11, 2018 at 10:50 AM Marek Kolodziej  wrote:
>
> > *Hi everyone,This is a quick summary of NVIDIA’s plans for open-sourcing
> an
> > initial integration of TensorRT as a runtime accelerator of MxNet (PR for
> > discussion coming in the next few days, ETA of the first draft of the PR
> is
> > this Friday or even earlier). Feedback is appreciated.Best,Marek
> > KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT provides
> > significant acceleration of model inference on NVIDIA GPUs compared to
> > running the full graph in MxNet using unfused GPU operators. In addition
> to
> > faster fp32 inference, TensorRT optimizes fp16 inference, and is capable
> of
> > int8 inference (provided the quantization steps are performed). Besides
> > increasing throughput, TensorRT significantly reduces inference latency,
> > especially for small batches. See more here
> > .2. Despite its benefits, using
> > pre-trained models with TensorRT typically requires some effort - either
> > re-writing the model using TensorRT’s graph building APIs, or exporting a
> > model to ONNX, followed by an import step. Even if the import is
> simplified
> > using ONNX, the TensorRT user still needs to provide their own data
> > pipeline, which used to exist in the framework, but no longer does in a
> > stand-alone TensorRT deployment with a client application.3. TensorRT is
> > very performant, but does not have the full set of MxNet’s operators.
> While
> > that could be addressed with TensorRT plugins, it’s much simpler to reuse
> > already-exisitng MxNet operators. Also, the user shouldn’t care about
> > knowing which operators are supported by TensorRT and which ones aren’t -
> > runtime integration allows the graph partitioner to extract subgraphs
> > capable of running inside of TensorRT, place the subgraph in a TensorRT
> > operator in MxNet, execute that operator as part of MxNet’s graph
> > execusion, and handle non-TensorRT-compatible nodes as regular MxNet
> > operators remaining after the TensorRT subgraph extraction and node
> > substitution. The goal is to accelerate inference without changing user
> > experience.Design considerations 1. Since TensorRT can only determine all
> > possible optimizations once the tensor shapes are known, it is imperative
> > that all the shape information be provided. This means that the best time
> > to construct the TensorRT graph is bind time. The coming PR can
> selectively
> > apply the TensorRT optimization for inference-only graphs at symbol bind
> > time. This is in fact consistent with the assumptions about TensorRT made
> > on the MxNet Wiki here
> > <
> >
> https://cwiki.apache.org/confluence/display/MXNET/Unified+integration+with+external+acceleration+libraries
> > >.
> > 2. Since as mentioned in #1, TensorRT graph building needs shape
> > information only available at bind time, an important goal was not to
> > disrupt any existing APIs. Even though C++ permits default function
> > arguments, the Python bindings for symbol-related methods (e.g. simple
> > bind) are exposed via a C, not C++, API, wired on the Python side using
> > Ctypes (e.g. see here
> > <
> >
> https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/symbol/symbol.py#L1486:L1521
> > >
> > for the simple bind integration). This precludes the addition of extra
> > arguments without causing breaking changes in the C API. Also, adapting
> the
> > Python code to such changes wouldn’t be enough, since all frontend
> > languages use the C (not C++) API for the FFI. Fortunately, C API changes
> > could be avoided, by simply letting the user enable or disable the
> TensorRT
> > pass using an environment variable (USE_TENSORRT=1 to enable). This also
> > does not diminish the flexibility of the integration, since the graph
> pass
> > can read the environment variable each time symbol binding is done, and
> > hence permits turning the graph passes on and off, depending on need. The
> > ability to enable and disable the TensorRT pass at runtime also makes
> unit
> > testing easier.3. TensorRT requires that the workspace size is provided
> at
> > graph construction time. This value constitutes the upper limit on the
> > amount of memory that TensorRT can use, and does not determine immediate
> > use. Since this amount 

Re: Details regarding upcoming PR for runtime TensorRT integration

2018-06-11 Thread Marco de Abreu
Hello Marek,

this sounds great! Definitely looking forward to it.

It seems like our mailing list destroyed your formatting. You might want to
consider putting it into a Google Docs document or uploading it to
confluence.

Best regards,
Marco

On Mon, Jun 11, 2018 at 10:50 AM Marek Kolodziej  wrote:

> *Hi everyone,This is a quick summary of NVIDIA’s plans for open-sourcing an
> initial integration of TensorRT as a runtime accelerator of MxNet (PR for
> discussion coming in the next few days, ETA of the first draft of the PR is
> this Friday or even earlier). Feedback is appreciated.Best,Marek
> KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT provides
> significant acceleration of model inference on NVIDIA GPUs compared to
> running the full graph in MxNet using unfused GPU operators. In addition to
> faster fp32 inference, TensorRT optimizes fp16 inference, and is capable of
> int8 inference (provided the quantization steps are performed). Besides
> increasing throughput, TensorRT significantly reduces inference latency,
> especially for small batches. See more here
> .2. Despite its benefits, using
> pre-trained models with TensorRT typically requires some effort - either
> re-writing the model using TensorRT’s graph building APIs, or exporting a
> model to ONNX, followed by an import step. Even if the import is simplified
> using ONNX, the TensorRT user still needs to provide their own data
> pipeline, which used to exist in the framework, but no longer does in a
> stand-alone TensorRT deployment with a client application.3. TensorRT is
> very performant, but does not have the full set of MxNet’s operators. While
> that could be addressed with TensorRT plugins, it’s much simpler to reuse
> already-exisitng MxNet operators. Also, the user shouldn’t care about
> knowing which operators are supported by TensorRT and which ones aren’t -
> runtime integration allows the graph partitioner to extract subgraphs
> capable of running inside of TensorRT, place the subgraph in a TensorRT
> operator in MxNet, execute that operator as part of MxNet’s graph
> execusion, and handle non-TensorRT-compatible nodes as regular MxNet
> operators remaining after the TensorRT subgraph extraction and node
> substitution. The goal is to accelerate inference without changing user
> experience.Design considerations 1. Since TensorRT can only determine all
> possible optimizations once the tensor shapes are known, it is imperative
> that all the shape information be provided. This means that the best time
> to construct the TensorRT graph is bind time. The coming PR can selectively
> apply the TensorRT optimization for inference-only graphs at symbol bind
> time. This is in fact consistent with the assumptions about TensorRT made
> on the MxNet Wiki here
> <
> https://cwiki.apache.org/confluence/display/MXNET/Unified+integration+with+external+acceleration+libraries
> >.
> 2. Since as mentioned in #1, TensorRT graph building needs shape
> information only available at bind time, an important goal was not to
> disrupt any existing APIs. Even though C++ permits default function
> arguments, the Python bindings for symbol-related methods (e.g. simple
> bind) are exposed via a C, not C++, API, wired on the Python side using
> Ctypes (e.g. see here
> <
> https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/symbol/symbol.py#L1486:L1521
> >
> for the simple bind integration). This precludes the addition of extra
> arguments without causing breaking changes in the C API. Also, adapting the
> Python code to such changes wouldn’t be enough, since all frontend
> languages use the C (not C++) API for the FFI. Fortunately, C API changes
> could be avoided, by simply letting the user enable or disable the TensorRT
> pass using an environment variable (USE_TENSORRT=1 to enable). This also
> does not diminish the flexibility of the integration, since the graph pass
> can read the environment variable each time symbol binding is done, and
> hence permits turning the graph passes on and off, depending on need. The
> ability to enable and disable the TensorRT pass at runtime also makes unit
> testing easier.3. TensorRT requires that the workspace size is provided at
> graph construction time. This value constitutes the upper limit on the
> amount of memory that TensorRT can use, and does not determine immediate
> use. Since this amount can be hard for the user to know, its limit should
> be set to a reasonable value that the user need not concern themselves
> with. Given that TensorRT integration is applied at bind time and that
> TensorRT engines wrapped in TensorRT nodes are constructed during the graph
> pass rather than the memory allocation pass,  MxNet will only allocate the
> amount needed for the nodes remaining after the TensorRT subgraphs have
> been extracted. This means that no memory will be doubly allocated - first
> for the complete MxNet subgraph and then for 

Details regarding upcoming PR for runtime TensorRT integration

2018-06-11 Thread Marek Kolodziej
*Hi everyone,This is a quick summary of NVIDIA’s plans for open-sourcing an
initial integration of TensorRT as a runtime accelerator of MxNet (PR for
discussion coming in the next few days, ETA of the first draft of the PR is
this Friday or even earlier). Feedback is appreciated.Best,Marek
KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT provides
significant acceleration of model inference on NVIDIA GPUs compared to
running the full graph in MxNet using unfused GPU operators. In addition to
faster fp32 inference, TensorRT optimizes fp16 inference, and is capable of
int8 inference (provided the quantization steps are performed). Besides
increasing throughput, TensorRT significantly reduces inference latency,
especially for small batches. See more here
.2. Despite its benefits, using
pre-trained models with TensorRT typically requires some effort - either
re-writing the model using TensorRT’s graph building APIs, or exporting a
model to ONNX, followed by an import step. Even if the import is simplified
using ONNX, the TensorRT user still needs to provide their own data
pipeline, which used to exist in the framework, but no longer does in a
stand-alone TensorRT deployment with a client application.3. TensorRT is
very performant, but does not have the full set of MxNet’s operators. While
that could be addressed with TensorRT plugins, it’s much simpler to reuse
already-exisitng MxNet operators. Also, the user shouldn’t care about
knowing which operators are supported by TensorRT and which ones aren’t -
runtime integration allows the graph partitioner to extract subgraphs
capable of running inside of TensorRT, place the subgraph in a TensorRT
operator in MxNet, execute that operator as part of MxNet’s graph
execusion, and handle non-TensorRT-compatible nodes as regular MxNet
operators remaining after the TensorRT subgraph extraction and node
substitution. The goal is to accelerate inference without changing user
experience.Design considerations 1. Since TensorRT can only determine all
possible optimizations once the tensor shapes are known, it is imperative
that all the shape information be provided. This means that the best time
to construct the TensorRT graph is bind time. The coming PR can selectively
apply the TensorRT optimization for inference-only graphs at symbol bind
time. This is in fact consistent with the assumptions about TensorRT made
on the MxNet Wiki here
.
2. Since as mentioned in #1, TensorRT graph building needs shape
information only available at bind time, an important goal was not to
disrupt any existing APIs. Even though C++ permits default function
arguments, the Python bindings for symbol-related methods (e.g. simple
bind) are exposed via a C, not C++, API, wired on the Python side using
Ctypes (e.g. see here

for the simple bind integration). This precludes the addition of extra
arguments without causing breaking changes in the C API. Also, adapting the
Python code to such changes wouldn’t be enough, since all frontend
languages use the C (not C++) API for the FFI. Fortunately, C API changes
could be avoided, by simply letting the user enable or disable the TensorRT
pass using an environment variable (USE_TENSORRT=1 to enable). This also
does not diminish the flexibility of the integration, since the graph pass
can read the environment variable each time symbol binding is done, and
hence permits turning the graph passes on and off, depending on need. The
ability to enable and disable the TensorRT pass at runtime also makes unit
testing easier.3. TensorRT requires that the workspace size is provided at
graph construction time. This value constitutes the upper limit on the
amount of memory that TensorRT can use, and does not determine immediate
use. Since this amount can be hard for the user to know, its limit should
be set to a reasonable value that the user need not concern themselves
with. Given that TensorRT integration is applied at bind time and that
TensorRT engines wrapped in TensorRT nodes are constructed during the graph
pass rather than the memory allocation pass,  MxNet will only allocate the
amount needed for the nodes remaining after the TensorRT subgraphs have
been extracted. This means that no memory will be doubly allocated - first
for the complete MxNet subgraph and then for TensorRT. However, the
question remains whether the memory used per TensorRT engine should be a
configurable parameter, either as a method argument or an environment
variable, or whether TensorRT should be able to use the maximum available
GPU memory and then reserve only what it needs. I would like to suggest the
latter. Since the TensorRT subgraph will typically use less memory than the
same subgraph in MxNet (due to more layer fusion), it’s