aiengine_ml_v2_intrinsics/intrinsics/group__intr__gpvectorop__mul.html

MSC:    res = acc_in1 - (X_vec x Y_vec)

NEGMUL: res = - (X_vec x Y_vec)

MACMUL: res = (zero_acc1 ? 0 : acc_in1) + (X_vec x Y_vec)

ADDMAC: res = acc_in1 + acc_in2 + (X_vec x Y_vec)

ADDMSC: res = acc_in1 + acc_in2 - (X_vec x Y_vec)
Topics
	Emulated Multiply-accumulate of 16-bit Complex Brain Floating-Point
	Elementwise matrix multiplications emulated on top of bfloat16.

	Emulated Multiply-accumulate of 16b x 32b datatypes
	Matrix multiplications in which matrix A has data elements of 16 bit and matrix B has data elements of 32 bit. These operations are emulated on top of Multiply-accumulate of 16b x 16b integer datatypes and might not have optimal performance.

	Emulated Multiply-accumulate of 32-bit Floating-Point
	Elementwise matrix multiplications emulated on top of bfloat16.

	Emulated Multiply-accumulate of 32b x 16b datatypes
	Matrix multiplications in which matrix A has data elements of 32 bit and matrix B has data elements of 16 bit. These operations are emulated on top of Multiply-accumulate of 16b x 16b integer datatypes and might not have optimal performance.

	Emulated Multiply-accumulate of 32b x 32b datatypes
	Matrix multiplications in which matrix A has data elements of 32 bit and matrix B has data elements of 32 bit. These operations are emulated on top of Multiply-accumulate of 32b x 16b integer datatypes and Multiply-accumulate of 16b x 16b integer datatypes and might not have optimal performance.

	Emulated Multiply-accumulate of Complex 32b x Complex 32b datatypes
	Matrix multiplications in which matrix A has data elements of complex 32 bit and matrix B has data elements of complex 32 bit. These operations are emulated on top of Multiply-accumulate of 32b x 16b complex integer datatypes and might not have optimal performance.

	Emulated Multiply-accumulate of Complex Float and Float datatypes
	Elementwise operations based on the already emulated FP32 operations (see Emulated Multiply-accumulate of 32-bit Floating-Point). These operations might not have optimal performance.

	Multiply-accumulate of 16b x 16b complex integer datatypes

	Multiply-accumulate of 16b x 16b integer datatypes

	Multiply-accumulate of 32b x 16b complex integer datatypes

	Multiply-accumulate of 32b x 16b integer datatypes

	Multiply-accumulate of 8b x 8b integer datatypes

	Multiply-accumulate of MX4 datatypes

	Multiply-accumulate of MX6 datatypes

	Multiply-accumulate of MX9 datatypes

	Multiply-accumulate of bfloat16 * float16 datatypes

	Multiply-accumulate of bfloat16 datatypes

	Multiply-accumulate of float16 * bfloat16 datatypes

	Multiply-accumulate of float16 datatypes

	Multiply-accumulate of float8 and bfloat8 datatypes

	Multiply-accumulate with a sparse matrix

	Negation control in complex multiplication modes
	In order to do complex multiplications, some terms need to be negated.

	Vector x scalar multiply-accumulate of 16b x 16b complex integer datatypes

	Vector x scalar multiply-accumulate of 16b x 16b datatypes

	Vector x scalar multiply-accumulate of 32b x 16b complex integer datatypes

	Vector x scalar multiply-accumulate of 8b x 8b datatypes

	Vector x scalar multiply-accumulate of float16 and bfloat16 datatypes
Precision Mode	Channels	Matrix A	Matrix B	Matrix C
8-bit x 8-bit = 32-bit	1	8x8	8x8	8x8
8-bit x 8-bit = 32-bit	1	8x16	(16x8)T (sparse)	8x8
8-bit x 8-bit = 32-bit	1	4x16	(16x8)T (sparse)	4x8*
8-bit x 8-bit = 32-bit	1	4x8	8x16	4x16
8-bit x 8-bit = 32-bit	1	4x16	(16x16)T (sparse)	4x16
8-bit x 8-bit = 32-bit	64	1x2	2x1	1x1
8-bit x 8-bit = 32-bit	64	1x1	1x1	1x1
8-bit x 8-bit = 32-bit	8	8x8 (conv.)	8x1	8x1
8-bit x 8-bit = 32-bit	1	64x8 (conv.)	8x1	64x1
16-bit x 16-bit = 32-bit	1	8x2	2x8	8x8
16-bit x 16-bit = 32-bit	64	1x1	1x1	1x1
16-bit x 16-bit = 32-bit	32	1x1	1x1	1x1*
16-bit x 16-bit = 64-bit	1	4x4	4x8	4x8
16-bit x 16-bit = 64-bit	1	4x8	(8x8)T (sparse)	4x8
16-bit x 16-bit = 64-bit	32	1x2	2x1	1x1
16-bit x 16-bit = 64-bit	32	1x1	1x1	1x1
16-bit x 16-bit = 64-bit	1	32x4 (conv.)	4x1	32x1
16-bit x 16-bit = 64-bit	8	4x4 (conv.)	4x1	4x1
Complex 16-bit x Complex 16-bit = 64-bit	16	1x2	2x1	1x1
Complex 16-bit x Complex 16-bit = 64-bit	16	1x1	1x1	1x1
32-bit x 16-bit = 64-bit	1	4x2	2x8	4x8
Complex 32-bit x Complex 16-bit = 64-bit	16	1x1	1x1	1x1
Complex 32-bit x Complex 16-bit = 64-bit	8	1x1	1x1	1x1*
fp8 x fp8 = fp32	1	8x8	8x8	8x8
fp8 x fp8 = fp32	1	4x16	(16x16)T (sparse)	4x16
fp16 x fp16 = fp32	64	1x1	1x1	1x1
fp16 x fp16 = fp32	32	1x1	1x1	1x1*
fp16 x fp16 = fp32	1	4x8	8x8	4x8
fp16 x fp16 = fp32	1	4x16	(16x8)T (sparse)	4x8
bfloat16 x bfloat16 = fp32	64	1x1	1x1	1x1
bfloat16 x bfloat16 = fp32	32	1x1	1x1	1x1*
bfloat16 x bfloat16 = fp32	1	4x8	8x8	4x8
bfloat16 x bfloat16 = fp32	1	4x16	(16x8)T (sparse)	4x8
MX9 x MX9 = fp32	1	4x8 4x8	(8x16)T (8x16)T	4x16
MX6 x MX6 = fp32	1	4x16	(16x16)T	4x16