AI Engine-ML Intrinsics User Guide (v2025.1)
Loading...
Searching...
No Matches
Multiply Accumulate

Intrinsics allowing you to perform MUL/MAC operations and a few of their variants. More...

Overview

Intrinsics allowing you to perform MUL/MAC operations and a few of their variants.

For integer datatypes, a matrix A of size MxN is multiplied with a matrix B of size NxP. The naming convention for these operations is: [operation][_MxN_NxP]{_Cch}{_conf} or [operation]_conv_MxN{_Cch}{_conf}. Properties in [] are mandatory, properties in {} are optional. In this naming, conv indicates a convolutional operation, conf indicates the use of sub, zero or shift masks and C gives the number of channels.
For an MxN vector multiply convolution operation, the calculation performed is:

\[ \text{mul_conv_MxN}(F,G) = \sum_{u=0}^{\text{N}-1}{G(u) F(x+u)} \]

where the vector \(F\) has length \(\text{M}+\text{N}-1\), and the vector \(G\) has length \(\text{N}\).

For element-wise operations, the naming is [operation_elem_C]{_N}. Here, C is the number of channels and N is the number of columns of matrix A/rows of matrix B. N is either two or it is omitted. The element-wise operations are executed channel by channel. The output will also be a matrix of with C channels.

For complex datatypes, a multiplication of two matrices with complex elements is performed. The naming convention for these operations is [operation_elem_8]{_conf} for Multiply-accumulate of 32b x 16b complex integer datatypes and [operation_elem_8_2]{_conf} for Multiply-accumulate of 16b x 16b complex integer datatypes. Here, eight is the number of channels and the two is the number columns of matrix A/rows of matrix B. The matrix multiplication is performed indvidually for each channel of the input matrices. The output will also be a matrix with eight channels.

The following table shows the matrix multiplications that can be completed within a single cycle.

Precision Mode Channels Matrix A Matrix B Matrix C
8-bit x 4-bit = 32-bit 1 4x16 16x8 4x8
8-bit x 4-bit = 32-bit 1 4x32 32x8 (sparse) 4x8
8-bit x 8-bit = 32-bit 1 4x8 8x8 4x8
8-bit x 8-bit = 32-bit 32 1x2 2x1 1x1
8-bit x 8-bit = 32-bit 8 4x4 (convolution) 4x1 4x1
8-bit x 8-bit = 32-bit 4 8x8 (convolution) 8x1 8x1
8-bit x 8-bit = 32-bit 1 32x8 (convolution) 8x1 32x1
8-bit x 8-bit = 32-bit 1 4x16 16x8 (sparse) 4x8
16-bit x 8-bit = 32-bit 1 4x4 4x8 4x8
16-bit x 8-bit = 32-bit 2 4x4 4x4 4x4
16-bit x 16-bit = 32-bit 1 4x2 2x8 4x8
16-bit x 16-bit = 32-bit 32 1x1 1x1 1x1
16-bit x 8-bit = 64-bit 1 2x8 8x8 2x8
16-bit x 8-bit = 64-bit 1 4x8 8x4 4x4
16-bit x 8-bit = 64-bit 1 2x16 16x8 (sparse) 2x8
16-bit x 16-bit = 64-bit 1 2x4 4x8 2x8
16-bit x 16-bit = 64-bit 1 4x4 4x4 4x4
16-bit x 16-bit = 64-bit 16 1x2 2x1 1x1
16-bit x 16-bit = 64-bit 1 16x4 (convolution) 4x1 16x1
Complex 16-bit x Complex 16-bit = 64-bit 8 1x2 2x1 1x1
16-bit x 16-bit = 64-bit 1 2x8 8x8 (sparse) 2x8
32-bit x 16-bit = 64-bit 1 4x2 2x4 4x4
Complex 32-bit x Complex 16-bit = 64-bit 8 1x1 1x1 1x1
bfloat16 x bfloat16 = fp32 1 4x8 8x4 4x4
bfloat16 x bfloat16 = fp32 16 1x2 2x1 1x1
bfloat16 x bfloat16 = fp32 1 4x16 16x4 (sparse) 4x4
bfloat16 x cbfloat16 = fp32 1 2x8 8x2 2x2
cbfloat16 x bfloat16 = fp32 1 2x8 8x2 2x2
cbfloat16 x cbfloat16 = fp32 1 2x8 8x2 2x2
cbfloat16 x cbfloat16 = fp32 8 1x2 2x1 1x1
bfloat16 x cbfloat16 = fp32 8 1x2 2x1 1x1
cbfloat16 x bfloat16 = fp32 8 1x2 2x1 1x1

Element Arrangement for [operation]_conv_MxN{_Cch}{_conf} Intrinsic

Taking an example of v32acc32 mac_conv_4x4_8ch_conf (v64int8 a, v64int8 b, v32acc32 acc1, int zero_acc1, int shift16, int sub_mul, int sub_acc1):

Vector A arrangement

Arranged channel by channel, with padding (X) added at the end

a1, a2, a3, a4, a5, a6, a7, X,
a9, a10, a11, a12, a13, a14, a15, X,
...
a57, a58, a59, a60, a61, a62, a63, X

Each channel contains 7 valid values followed by 1 unused padding value('X') that can take any value

Vector B arrangement

Similar arrangement as A

b1, b2, b3, b4, Y, Y, Y, Y,
b9, b10, b11, b12, Y, Y, Y, Y,
...
b57, b58, b59, b60, Y, Y, Y, Y

Each channel contains 4 valid values, followed by 4 unused padding values (Y).

Vector C(output) arrangement

c1, c2, c3, c4, c5, c6, c7, c8, ..., c29, c30, c31, c32

where channel 1 output is c1, c2, c3, c4 channel2 output is c5, c6, c7, c8 ... channel8 output is c29, c30, c31, c32

Matrix mult intrinsics

We can summarize the MUL and the MAC operation like this:

MAC: res = acc_in1 + (X_vec x Y_vec)
MUL: res = (X_vec x Y_vec)

The 'x' operator being the matrix multiplication operator. The same way we can summarize the MSC, NEGMUL, MACMUL and MAC/MSC variants with additional acc_in2 input operations as this:

MSC: res = acc_in1 - (X_vec x Y_vec)
NEGMUL: res = - (X_vec x Y_vec)
MACMUL: res = (zero_acc1 ? 0 : acc_in1) + (X_vec x Y_vec)
ADDMAC: res = acc_in1 + acc_in2 + (X_vec x Y_vec)
ADDMSC: res = acc_in1 + acc_in2 - (X_vec x Y_vec)
SUBMAC: res = acc_in1 - acc_in2 + (X_vec x Y_vec)
SUBMSC: res = acc_in1 - acc_in2 - (X_vec x Y_vec)

The convolve variants

The convolve variants of these intrinsics differs as they apply a convolution product on the vectors instead of a matrix multiplication. The '*' operator being the vector convolution operator. Therefore, the X_vec is the matrix, and Y_vec the kernel.

MAC: res = acc_in1 + (X_vec * Y_vec)
MUL: res = (X_vec * Y_vec)
MSC: res = acc_in1 - (X_vec * Y_vec)
NEGMUL: res = - (X_vec * Y_vec)
MACMUL: res = (zero_acc1 ? 0 : acc_in1) + (X_vec * Y_vec)
ADDMAC: res = acc_in1 + acc_in2 + (X_vec * Y_vec)
ADDMSC: res = acc_in1 + acc_in2 - (X_vec * Y_vec)
SUBMAC: res = acc_in1 - acc_in2 + (X_vec * Y_vec)
SUBMSC: res = acc_in1 - acc_in2 - (X_vec * Y_vec)

Zeroing, sign and negation masks

Some variant allow the passing of masks that are used to determine sign, zeroing and negation of vector or accumulator lanes. These masks are the following:

int sgn_x: Sign mask of matrix X. If it is one matrix X is interpreted as signed, else it treated as unsigned.
int sgn_y: Sign mask of matrix Y. If it is one matrix Y is interpreted as signed, else it treated as unsigned.
int zero_acc1: Zeroing of acc1. If it is one then acc1 is zeroed.
int zero_acc2: Zeroing of acc2. If it is one then acc2 is zeroed.
int sub_mul: Negation mask of the matrix multiplication result. If it is one the result of the operation will be negated.
int sub_acc1: Negation mask of acc1. If it is one acc1 will be negated.
int sub_acc2: Negation mask of acc2. If it is one acc2 will be negated.
int shift16: Shift mask of acc1. If a bit is set the <<16 operation will be executed on acc1.
int sub_mask: Negation mask of complex multiplications. Negates a term of a complex multiplication.

Complex multiplications require some terms to be negated in order to implement conjugation and minus j multiplication. This is done through the sub_mask. The following examples show how this mask is used when two complex numbers, X and Y, are multiplied to get an output O. For Multiply-accumulate of 16b x 16b complex integer datatypes there are two complex numbers post-added. They are indicated by the postfix 0/1:

O[re] = -1^sub_mask[0] * X[re0] * Y[re0] + -1^sub_mask[1] * X[im0] * Y[im0]
+ -1^sub_mask[2] * X[re1] * Y[re1] + -1^sub_mask[3] * X[im1] * Y[im1]
O[im] = -1^sub_mask[4] * X[re0] * Y[im0] + -1^sub_mask[5] * X[im0] * Y[re0]
+ -1^sub_mask[6] * X[re1] * Y[im1] + -1^sub_mask[7] * X[im1] * Y[re1]

For Multiply-accumulate of 32b x 16b complex integer datatypes there is no postadding and only four unique terms are needed. However, all 8 bit must be specified apropriately. In the following equation the index bits used for one term must be the same value.

O[re] = -1^sub_mask[0|2] * X[re] * Y[re] + -1^sub_mask[1|3] * X[im] * Y[im]
O[im] = -1^sub_mask[4|6] * X[re] * Y[im] + -1^sub_mask[5|6] * X[im] * Y[re]

Multiplication of matrices with multiple channels

Some intrinsics are used for multiplications of matrices with a given number of channels. Each MxN matrix is stored in row-major and channel-minor fashion. The following example shows the resulting layout of elements in the vector for a 4x4 matrix with two channels. The indexes for each element are given as (m,n,c)

[a(0,0,0) a(0,0,1) a(0,1,0) a(0,1,1) a(0,2,0) a(0,2,1) a(0,3,0) a(0,3,1)
a(1,0,0) a(1,0,1) a(1,1,0) a(1,1,1) a(1,2,0) a(1,2,1) a(1,3,0) a(1,3,1)
a(2,0,0) a(2,0,1) a(2,1,0) a(2,1,1) a(2,2,0) a(2,2,1) a(2,3,0) a(2,3,1)
a(3,0,0) a(3,0,1) a(3,1,0) a(3,1,1) a(3,2,0) a(3,2,1) a(3,3,0) a(3,3,1)]
Note
Matrices with multiple channels are used for convolutional and element-wise operations. Element-wise operations are performed along the channels. E.g. an element-wise mutltiplication of two matrices with 32 channels would perform a matrix multiplication for each individual channel. The output would again have 32 channels.

Element-wise multiplication

The elem variants allow you to perform element-wise operations. The operations are performed along the channels. For example, if you perform a (1x1x32) x (1x1x32) operation a multiplication will be done between the elements of the same channel. So, the elements of channel zero will be multiplied, the elements of channel one will be multiplied etc... The end result would again have 32 channels.

Some of the elem variants perform matrix multiplications along the channels. For those cases the multiplication (1x2xC) x (2x1xC) is performed. The end result is a (1x1x32) matrix. Despite the name, this is not a true element-wise multiplication.

Convolution operation

Convolutional operations work similar to element-wise multiplication. In every step the kernel will be multiplied with the matrix before it is shifted to the next position. The same is done for each channel. The difference to a regular element-wise multiplication is that after the multiplications for each channel have been completed the resulting matrices are added together so that the final result will have only one channel.

Considerations when using bfloat16 data type

When multiplying with a scalar bfloat16 it will be internally cast to float which influences the rounding behaviour with negation. The following example shows how this behaviour affects the multiplication. As the cast involves a rounding operation it matters if the negation is performed before or after the cast. In the first case, the rounding happens to the positive result before the negation. For the second and third case the rounding happens before that which will lead to a different result.

bfloat16 a, b;
auto v1 = -(a * v[0]); //This will not match the other operations because the rounding is done to the positive result before negation
auto v2 = (-a * v[0]);
auto v3 = (a * -v[0]);
Definition me_bfloat16.h:72

Considerations when using emulated FP32 Intrinsics

elementwise multiplication and matrix multiplication intrinsics for FP32 input type are emulated using bfloat16 data-path. There are 3 options to chose from. Default option (Most accurate but slow):

### _accuracy_safe intrinsics
Most accurate option since input fp32 number is split in to 3 bfloat16 numbers to extract all the bits of the mantissa.
float a, b;
a*b would require 9 mac operations due to 3 bfloat16 splits each.

Fast and accurate option:

### _accuracy_fast intrinsics
Application compile time flag "AIE2_FP32_EMULATION_ACCURACY_FAST": Fast and Accurate option.
Input fp32 number is split in to 3 bfloat16 numbers to extract all the bits of the mantissa.
float a,b;
both a and b are split in to 3 bfloat16 numbers each. Hence there would be 9 mac operations in multiplication of a and b.
In the 9 mac operations to emulate fp32 mul, mac operations with LSBs are ignored. (3 last terms).
This helps improve cycle count of mul and has least impact on accuracy of result.
float a, b;
a*b would require 6 mac operations.

Fastest option with loss of accuracy:

### _accuracy_low intrinsics
Application compile time flag "AIE2_FP32_EMULATION_ACCURACY_LOW": Fast and least accurate option.
Input fp32 number is split in to 2 bfloat16 numbers. Hence not all the bits from mantissa can be used.
float a,b;
Both a and b are split in to 2 bfloat16 numbers each. Hence there would be 4 mac operations in multiplication of a and b.
In the 4 mac operations to emulate fp32 mul, mac operations with LSBs are ignored. (1 last term).
This helps improve cycle count of mul
float a, b;
a*b would require 3 mac operations.

Modules

 Emulated Multiply-accumulate of 16-bit Complex Brain Floating-Point
 Elementwise matrix multiplications emulated on top of bfloat16.
 
 Emulated Multiply-accumulate of 16b x 32b datatypes
 Matrix multiplications in which matrix A has data elements of 16 bit and matrix B has data elements of 32 bit. These operations are emulated on top of Multiply-accumulate of 16b x 16b integer datatypes and might not have optimal performance.
 
 Emulated Multiply-accumulate of 32b x 16b datatypes
 Matrix multiplications in which matrix A has data elements of 32 bit and matrix B has data elements of 16 bit. These operations are emulated on top of Multiply-accumulate of 16b x 16b integer datatypes and might not have optimal performance.
 
 Emulated Multiply-accumulate of 32b x 32b datatypes
 Matrix multiplications in which matrix A has data elements of 32 bit and matrix B has data elements of 32 bit. These operations are emulated on top of Multiply-accumulate of 32b x 16b integer datatypes and Multiply-accumulate of 16b x 16b integer datatypes and might not have optimal performance.
 
 Emulated Multiply-accumulate of Complex 32b x Complex 32b datatypes
 Matrix multiplications in which matrix A has data elements of complex 32 bit and matrix B has data elements of complex 32 bit. These operations are emulated on top of Multiply-accumulate of 32b x 16b complex integer datatypes and might not have optimal performance.
 
 Emulated Multiply-accumulate of Complex Float and Float datatypes
 Elementwise operations based on the already emulated FP32 operations (see intr_gpvectorop_mul_emul_float). These operations might not have optimal performance.
 
 Emulated Multiply-accumulate of fp32 x fp32 datatypes
 Elementwise-multiplication and matrix multiplication using bfloat16 datapath. 2 options available. With or without set_rnd(0) for truncation before using these intrinsics. Use flag AIE_FP32_EMULATION_SET_RND_MODE flag to set rnd mode to truncation. For an explanation how these operations works see Multiply Accumulate.
 
 Multiply-accumulate of 16b x 16b complex integer datatypes
 Matrix multiplications in which matrix A and matrix B have complex data elements of 16 bit. For an explanation how these operations works see Multiply Accumulate.
 
 Multiply-accumulate of 16b x 16b integer datatypes
 Matrix multiplications in which matrix A and matrix B have data elements of 16 bit.
 
 Multiply-accumulate of 16b x 8b integer datatypes
 Matrix multiplications in which matrix A has data elements of 16 bit and matrix B has data elements of 8 bit.
 
 Multiply-accumulate of 32b x 16b complex integer datatypes
 Matrix multiplications in which matrix A has complex data elements of 32 bit and matrix B has complex data elements of 16 bit.
 
 Multiply-accumulate of 32b x 16b integer datatypes
 Matrix multiplications in which matrix A has data elements of 32 bit and matrix B has data elements of 16 bit.
 
 Multiply-accumulate of 8b x 4b datatypes
 Matrix multiplications in which matrix A has data elements of 8 bit and matrix B has data elements of 4 bit. These operations are emulated on top of int8 x int8.
 
 Multiply-accumulate of 8b x 8b integer datatypes
 Matrix multiplications in which matrix A and matrix B have data elements of 8 bit.
 
 Multiply-accumulate of bfloat16 datatypes
 Matrix multiplications in which matrix A and B have bfloat16 data elements.
 
 Multiply-accumulate with a sparse matrix
 Matrix multiplications in which matrix B is a sparse matrix.
 
 Negation control in complex multiplication modes
 In order to do complex multiplications, some terms need to be negated.
 

Functions

_INLINE v8caccfloat mac_elem_8_conf (v8cfloat v1, v8float v2, v8caccfloat acc, int zero_acc, int sub_mask, int sub_mul, int sub_acc1)
 
_INLINE v8caccfloat msc_elem_8_conf (v8cfloat v1, v8float v2, v8caccfloat acc, int zero_acc, int sub_mask, int sub_mul, int sub_acc1)
 
_INLINE v8caccfloat mul_elem_8_conf (v8cfloat v1, v8float v2, int sub_mask, int sub_mul)
 
_INLINE v8caccfloat negmul_elem_8_conf (v8cfloat v1, v8float v2, int sub_mask, int sub_mul)
 

Channel by channel multiplication of (1x2) with (2x1)  

v16accfloat mul_elem_16_2 (v32bfloat16 a, v32bfloat16 b)
 
v16accfloat negmul_elem_16_2 (v32bfloat16 a, v32bfloat16 b)
 
v16accfloat mac_elem_16_2 (v32bfloat16 a, v32bfloat16 b, v16accfloat acc1)
 
v16accfloat msc_elem_16_2 (v32bfloat16 a, v32bfloat16 b, v16accfloat acc1)
 
v16accfloat addmac_elem_16_2 (v32bfloat16 a, v32bfloat16 b, v16accfloat acc1, v16accfloat acc2)
 
v16accfloat addmsc_elem_16_2 (v32bfloat16 a, v32bfloat16 b, v16accfloat acc1, v16accfloat acc2)
 

Channel by channel multiplication of (1x2) with (2x1) with dynamic negation of multiplication result

v16accfloat mul_elem_16_2_conf (v32bfloat16 a, v32bfloat16 b, int sub_mul)
 
v16accfloat negmul_elem_16_2_conf (v32bfloat16 a, v32bfloat16 b, int sub_mul)
 

Channel by channel multiplication of (1x2) with (2x1) with dynamic negation of multiplication result, zeroing of acc1, and negation of acc1

v16accfloat mac_elem_16_2_conf (v32bfloat16 a, v32bfloat16 b, v16accfloat acc1, int zero_acc1, int sub_mul, int sub_acc1)
 
v16accfloat msc_elem_16_2_conf (v32bfloat16 a, v32bfloat16 b, v16accfloat acc1, int zero_acc1, int sub_mul, int sub_acc1)
 
v16accfloat addmac_elem_16_2_conf (v32bfloat16 a, v32bfloat16 b, v16accfloat acc1, v16accfloat acc2, int zero_acc1, int sub_mul, int sub_acc1, int sub_acc2)
 
v16accfloat addmsc_elem_16_2_conf (v32bfloat16 a, v32bfloat16 b, v16accfloat acc1, v16accfloat acc2, int zero_acc1, int sub_mul, int sub_acc1, int sub_acc2)
 

Function Documentation

◆ addmac_elem_16_2()

v16accfloat addmac_elem_16_2 ( v32bfloat16  a,
v32bfloat16  b,
v16accfloat  acc1,
v16accfloat  acc2 
)
Parameters
aMatrix A
bMatrix B
acc1Accumulator 1 input
acc2Accumulator 2 input
Returns
Result of operation

◆ addmac_elem_16_2_conf()

v16accfloat addmac_elem_16_2_conf ( v32bfloat16  a,
v32bfloat16  b,
v16accfloat  acc1,
v16accfloat  acc2,
int  zero_acc1,
int  sub_mul,
int  sub_acc1,
int  sub_acc2 
)
Parameters
aMatrix A
bMatrix B
acc1Accumulator 1 input
acc2Accumulator 2 input
zero_acc1Zeroing mask for acc1
sub_mulNegation mask of multiplication result
sub_acc1Negation mask of acc1
sub_acc2Negation mask of acc2
Returns
Result of operation

◆ addmsc_elem_16_2()

v16accfloat addmsc_elem_16_2 ( v32bfloat16  a,
v32bfloat16  b,
v16accfloat  acc1,
v16accfloat  acc2 
)
Parameters
aMatrix A
bMatrix B
acc1Accumulator 1 input
acc2Accumulator 2 input
Returns
Result of operation

◆ addmsc_elem_16_2_conf()

v16accfloat addmsc_elem_16_2_conf ( v32bfloat16  a,
v32bfloat16  b,
v16accfloat  acc1,
v16accfloat  acc2,
int  zero_acc1,
int  sub_mul,
int  sub_acc1,
int  sub_acc2 
)
Parameters
aMatrix A
bMatrix B
acc1Accumulator 1 input
acc2Accumulator 2 input
zero_acc1Zeroing mask for acc1
sub_mulNegation mask of multiplication result
sub_acc1Negation mask of acc1
sub_acc2Negation mask of acc2
Returns
Result of operation

◆ mac_elem_16_2()

v16accfloat mac_elem_16_2 ( v32bfloat16  a,
v32bfloat16  b,
v16accfloat  acc1 
)
Parameters
aMatrix A
bMatrix B
acc1Accumulator 1 input
Returns
Result of operation

◆ mac_elem_16_2_conf()

v16accfloat mac_elem_16_2_conf ( v32bfloat16  a,
v32bfloat16  b,
v16accfloat  acc1,
int  zero_acc1,
int  sub_mul,
int  sub_acc1 
)
Parameters
aMatrix A
bMatrix B
acc1Accumulator 1 input
zero_acc1Zeroing mask for acc1
sub_mulNegation mask of multiplication result
sub_acc1Negation mask of acc1
Returns
Result of operation

◆ mac_elem_8_conf()

_INLINE v8caccfloat mac_elem_8_conf ( v8cfloat  v1,
v8float  v2,
v8caccfloat  acc,
int  zero_acc,
int  sub_mask,
int  sub_mul,
int  sub_acc1 
)
Parameters
v1Matrix A
v2Matrix B
accAccumulator 1 input
Returns
Result of operation

◆ msc_elem_16_2()

v16accfloat msc_elem_16_2 ( v32bfloat16  a,
v32bfloat16  b,
v16accfloat  acc1 
)
Parameters
aMatrix A
bMatrix B
acc1Accumulator 1 input
Returns
Result of operation

◆ msc_elem_16_2_conf()

v16accfloat msc_elem_16_2_conf ( v32bfloat16  a,
v32bfloat16  b,
v16accfloat  acc1,
int  zero_acc1,
int  sub_mul,
int  sub_acc1 
)
Parameters
aMatrix A
bMatrix B
acc1Accumulator 1 input
zero_acc1Zeroing mask for acc1
sub_mulNegation mask of multiplication result
sub_acc1Negation mask of acc1
Returns
Result of operation

◆ msc_elem_8_conf()

_INLINE v8caccfloat msc_elem_8_conf ( v8cfloat  v1,
v8float  v2,
v8caccfloat  acc,
int  zero_acc,
int  sub_mask,
int  sub_mul,
int  sub_acc1 
)
Parameters
v1Matrix A
v2Matrix B
accAccumulator 1 input
Returns
Result of operation

◆ mul_elem_16_2()

v16accfloat mul_elem_16_2 ( v32bfloat16  a,
v32bfloat16  b 
)
Parameters
aMatrix A
bMatrix B
Returns
Result of operation

◆ mul_elem_16_2_conf()

v16accfloat mul_elem_16_2_conf ( v32bfloat16  a,
v32bfloat16  b,
int  sub_mul 
)
Parameters
aMatrix A
bMatrix B
sub_mulNegation mask for multiplication result
Returns
Result of operation

◆ mul_elem_8_conf()

_INLINE v8caccfloat mul_elem_8_conf ( v8cfloat  v1,
v8float  v2,
int  sub_mask,
int  sub_mul 
)
Parameters
v1Matrix A
v2Matrix B
Returns
Result of operation

◆ negmul_elem_16_2()

v16accfloat negmul_elem_16_2 ( v32bfloat16  a,
v32bfloat16  b 
)
Parameters
aMatrix A
bMatrix B
Returns
Result of operation

◆ negmul_elem_16_2_conf()

v16accfloat negmul_elem_16_2_conf ( v32bfloat16  a,
v32bfloat16  b,
int  sub_mul 
)
Parameters
aMatrix A
bMatrix B
sub_mulNegation mask for multiplication result. If a bit of sub_mul is set the corresponding vector lane of the output accumulator will be negated.
Returns
Result of operation

◆ negmul_elem_8_conf()

_INLINE v8caccfloat negmul_elem_8_conf ( v8cfloat  v1,
v8float  v2,
int  sub_mask,
int  sub_mul 
)
Parameters
v1Matrix A
v2Matrix B
Returns
Result of operation