Hi

While looking at the failures reporting in this issue:
https://github.com/apache/incubator-mxnet/issues/14522

I have noticed that in mshadow when calling the BLAS Engine we are
doing narrowing integer conversions from index_t (int64_t) to int, and
then operations on dimensions that can overflow integer arithmetic
such as i * m *k as seen in the second link below. Which when added to
the pointer holding the matrix data results in a) undefined behaviour,
and b) in x86_64 a subtraction instead of an addition due to platform
dependent integer overflow semantics in x86 platforms.

I think we should address this in a twofold manner: checking the
shapes for possible overflows in the implementation (which will have
some performance impact), and second we should widen the types of
BLASEngine to index_t.

https://github.com/dmlc/mshadow/blob/master/mshadow/tensor_cpu-inl.h#L613

Which in CPU ends up calling batched_gemm:

https://github.com/dmlc/mshadow/blob/master/mshadow/dot_engine-inl.h#L339

Let me know if you have additional thoughts in this solution or see
any blockers or better ideas, otherwise I will proceed to work on PRs
fixing this in mshadow. Since it's an issue that seem to happen often
I think we should be really careful with integer overflows and
undefined behaviour related bugs and pay attention in CRs to this kind
of traps.

Thanks!

Pedro.

Reply via email to