Intrinsics allowing you to perform MUL/MAC operations and a few of their variants.
More...
Intrinsics allowing you to perform MUL/MAC operations and a few of their variants.
For integer datatypes, a matrix A of size MxN is multiplied with a matrix B of size NxP. The naming convention for these operations is: [operation][_MxN_NxP]{_Cch}{_conf} or [operation]_conv_MxN{_Cch}{_conf}. Properties in [] are mandatory, properties in {} are optional. In this naming, conv indicates a convolutional operation, conf indicates the use of sub, zero or shift masks and C gives the number of channels.
For an MxN vector multiply convolution operation, the calculation performed is:
\[
\text{mul_conv_MxN}(F,G) = \sum_{u=0}^{\text{N}-1}{G(u) F(x+u)}
\]
where the vector \(F\) has length \(\text{M}+\text{N}-1\), and the vector \(G\) has length \(\text{N}\).
For element-wise operations, the naming is [operation_elem_C]{_N}. Here, C is the number of channels and N is the number of columns of matrix A/rows of matrix B. N is either two or it is omitted. The element-wise operations are executed channel by channel. The output will also be a matrix of with C channels.
For complex datatypes, a multiplication of two matrices with complex elements is performed. The naming convention for these operations is [operation_elem_8]{_conf} for Multiply-accumulate of 32b x 16b complex integer datatypes and [operation_elem_8_2]{_conf} for Multiply-accumulate of 16b x 16b complex integer datatypes. Here, eight is the number of channels and the two is the number columns of matrix A/rows of matrix B. The matrix multiplication is performed indvidually for each channel of the input matrices. The output will also be a matrix with eight channels.
The following table shows the matrix multiplications that can be completed within a single cycle.
Precision Mode | Channels | Matrix A | Matrix B | Matrix C |
8-bit x 4-bit = 32-bit | 1 | 4x16 | 16x8 | 4x8 |
8-bit x 4-bit = 32-bit | 1 | 4x32 | 32x8 (sparse) | 4x8 |
8-bit x 8-bit = 32-bit | 1 | 4x8 | 8x8 | 4x8 |
8-bit x 8-bit = 32-bit | 32 | 1x2 | 2x1 | 1x1 |
8-bit x 8-bit = 32-bit | 8 | 4x4 (convolution) | 4x1 | 4x1 |
8-bit x 8-bit = 32-bit | 4 | 8x8 (convolution) | 8x1 | 8x1 |
8-bit x 8-bit = 32-bit | 1 | 32x8 (convolution) | 8x1 | 32x1 |
8-bit x 8-bit = 32-bit | 1 | 4x16 | 16x8 (sparse) | 4x8 |
16-bit x 8-bit = 32-bit | 1 | 4x4 | 4x8 | 4x8 |
16-bit x 8-bit = 32-bit | 2 | 4x4 | 4x4 | 4x4 |
16-bit x 16-bit = 32-bit | 1 | 4x2 | 2x8 | 4x8 |
16-bit x 16-bit = 32-bit | 32 | 1x1 | 1x1 | 1x1 |
16-bit x 8-bit = 64-bit | 1 | 2x8 | 8x8 | 2x8 |
16-bit x 8-bit = 64-bit | 1 | 4x8 | 8x4 | 4x4 |
16-bit x 8-bit = 64-bit | 1 | 2x16 | 16x8 (sparse) | 2x8 |
16-bit x 16-bit = 64-bit | 1 | 2x4 | 4x8 | 2x8 |
16-bit x 16-bit = 64-bit | 1 | 4x4 | 4x4 | 4x4 |
16-bit x 16-bit = 64-bit | 16 | 1x2 | 2x1 | 1x1 |
16-bit x 16-bit = 64-bit | 1 | 16x4 (convolution) | 4x1 | 16x1 |
Complex 16-bit x Complex 16-bit = 64-bit | 8 | 1x2 | 2x1 | 1x1 |
16-bit x 16-bit = 64-bit | 1 | 2x8 | 8x8 (sparse) | 2x8 |
32-bit x 16-bit = 64-bit | 1 | 4x2 | 2x4 | 4x4 |
Complex 32-bit x Complex 16-bit = 64-bit | 8 | 1x1 | 1x1 | 1x1 |
bfloat16 x bfloat16 = fp32 | 1 | 4x8 | 8x4 | 4x4 |
bfloat16 x bfloat16 = fp32 | 16 | 1x2 | 2x1 | 1x1 |
bfloat16 x bfloat16 = fp32 | 1 | 4x16 | 16x4 (sparse) | 4x4 |
bfloat16 x cbfloat16 = fp32 | 1 | 2x8 | 8x2 | 2x2 |
cbfloat16 x bfloat16 = fp32 | 1 | 2x8 | 8x2 | 2x2 |
cbfloat16 x cbfloat16 = fp32 | 1 | 2x8 | 8x2 | 2x2 |
cbfloat16 x cbfloat16 = fp32 | 8 | 1x2 | 2x1 | 1x1 |
bfloat16 x cbfloat16 = fp32 | 8 | 1x2 | 2x1 | 1x1 |
cbfloat16 x bfloat16 = fp32 | 8 | 1x2 | 2x1 | 1x1 |
Element Arrangement for [operation]_conv_MxN{_Cch}{_conf} Intrinsic
Taking an example of v32acc32 mac_conv_4x4_8ch_conf (v64int8 a, v64int8 b, v32acc32 acc1, int zero_acc1, int shift16, int sub_mul, int sub_acc1):
Vector A arrangement
Arranged channel by channel, with padding (X
) added at the end
a1, a2, a3, a4, a5, a6, a7, X,
a9, a10, a11, a12, a13, a14, a15, X,
...
a57, a58, a59, a60, a61, a62, a63, X
Each channel contains 7 valid values followed by 1 unused padding value('X') that can take any value
Vector B arrangement
Similar arrangement as A
b1, b2, b3, b4, Y, Y, Y, Y,
b9, b10, b11, b12, Y, Y, Y, Y,
...
b57, b58, b59, b60, Y, Y, Y, Y
Each channel contains 4 valid values, followed by 4 unused padding values (Y).
Vector C(output) arrangement
c1, c2, c3, c4, c5, c6, c7, c8, ..., c29, c30, c31, c32
where channel 1 output is c1, c2, c3, c4 channel2 output is c5, c6, c7, c8 ... channel8 output is c29, c30, c31, c32
Matrix mult intrinsics
We can summarize the MUL and the MAC operation like this:
MAC: res = acc_in1 + (X_vec x Y_vec)
MUL: res = (X_vec x Y_vec)
The 'x' operator being the matrix multiplication operator. The same way we can summarize the MSC, NEGMUL, MACMUL and MAC/MSC variants with additional acc_in2 input operations as this:
MSC: res = acc_in1 - (X_vec x Y_vec)
NEGMUL: res = - (X_vec x Y_vec)
MACMUL: res = (zero_acc1 ? 0 : acc_in1) + (X_vec x Y_vec)
ADDMAC: res = acc_in1 + acc_in2 + (X_vec x Y_vec)
ADDMSC: res = acc_in1 + acc_in2 - (X_vec x Y_vec)
SUBMAC: res = acc_in1 - acc_in2 + (X_vec x Y_vec)
SUBMSC: res = acc_in1 - acc_in2 - (X_vec x Y_vec)
The convolve variants
The convolve variants of these intrinsics differs as they apply a convolution product on the vectors instead of a matrix multiplication. The '*' operator being the vector convolution operator. Therefore, the X_vec is the matrix, and Y_vec the kernel.
MAC: res = acc_in1 + (X_vec * Y_vec)
MUL: res = (X_vec * Y_vec)
MSC: res = acc_in1 - (X_vec * Y_vec)
NEGMUL: res = - (X_vec * Y_vec)
MACMUL: res = (zero_acc1 ? 0 : acc_in1) + (X_vec * Y_vec)
ADDMAC: res = acc_in1 + acc_in2 + (X_vec * Y_vec)
ADDMSC: res = acc_in1 + acc_in2 - (X_vec * Y_vec)
SUBMAC: res = acc_in1 - acc_in2 + (X_vec * Y_vec)
SUBMSC: res = acc_in1 - acc_in2 - (X_vec * Y_vec)
Zeroing, sign and negation masks
Some variant allow the passing of masks that are used to determine sign, zeroing and negation of vector or accumulator lanes. These masks are the following:
int sgn_x: Sign mask of matrix X. If it is one matrix X is interpreted as signed, else it treated as unsigned.
int sgn_y: Sign mask of matrix Y. If it is one matrix Y is interpreted as signed, else it treated as unsigned.
int zero_acc1: Zeroing of acc1. If it is one then acc1 is zeroed.
int zero_acc2: Zeroing of acc2. If it is one then acc2 is zeroed.
int sub_mul: Negation mask of the matrix multiplication result. If it is one the result of the operation will be negated.
int sub_acc1: Negation mask of acc1. If it is one acc1 will be negated.
int sub_acc2: Negation mask of acc2. If it is one acc2 will be negated.
int shift16: Shift mask of acc1. If a bit is set the <<16 operation will be executed on acc1.
int sub_mask: Negation mask of complex multiplications. Negates a term of a complex multiplication.
Complex multiplications require some terms to be negated in order to implement conjugation and minus j multiplication. This is done through the sub_mask. The following examples show how this mask is used when two complex numbers, X and Y, are multiplied to get an output O. For Multiply-accumulate of 16b x 16b complex integer datatypes there are two complex numbers post-added. They are indicated by the postfix 0/1:
O[re] = -1^sub_mask[0] * X[re0] * Y[re0] + -1^sub_mask[1] * X[im0] * Y[im0]
+ -1^sub_mask[2] * X[re1] * Y[re1] + -1^sub_mask[3] * X[im1] * Y[im1]
O[im] = -1^sub_mask[4] * X[re0] * Y[im0] + -1^sub_mask[5] * X[im0] * Y[re0]
+ -1^sub_mask[6] * X[re1] * Y[im1] + -1^sub_mask[7] * X[im1] * Y[re1]
For Multiply-accumulate of 32b x 16b complex integer datatypes there is no postadding and only four unique terms are needed. However, all 8 bit must be specified apropriately. In the following equation the index bits used for one term must be the same value.
O[re] = -1^sub_mask[0|2] * X[re] * Y[re] + -1^sub_mask[1|3] * X[im] * Y[im]
O[im] = -1^sub_mask[4|6] * X[re] * Y[im] + -1^sub_mask[5|6] * X[im] * Y[re]
Multiplication of matrices with multiple channels
Some intrinsics are used for multiplications of matrices with a given number of channels. Each MxN matrix is stored in row-major and channel-minor fashion. The following example shows the resulting layout of elements in the vector for a 4x4 matrix with two channels. The indexes for each element are given as (m,n,c)
[a(0,0,0) a(0,0,1) a(0,1,0) a(0,1,1) a(0,2,0) a(0,2,1) a(0,3,0) a(0,3,1)
a(1,0,0) a(1,0,1) a(1,1,0) a(1,1,1) a(1,2,0) a(1,2,1) a(1,3,0) a(1,3,1)
a(2,0,0) a(2,0,1) a(2,1,0) a(2,1,1) a(2,2,0) a(2,2,1) a(2,3,0) a(2,3,1)
a(3,0,0) a(3,0,1) a(3,1,0) a(3,1,1) a(3,2,0) a(3,2,1) a(3,3,0) a(3,3,1)]
- Note
- Matrices with multiple channels are used for convolutional and element-wise operations. Element-wise operations are performed along the channels. E.g. an element-wise mutltiplication of two matrices with 32 channels would perform a matrix multiplication for each individual channel. The output would again have 32 channels.
Element-wise multiplication
The elem variants allow you to perform element-wise operations. The operations are performed along the channels. For example, if you perform a (1x1x32) x (1x1x32) operation a multiplication will be done between the elements of the same channel. So, the elements of channel zero will be multiplied, the elements of channel one will be multiplied etc... The end result would again have 32 channels.
Some of the elem variants perform matrix multiplications along the channels. For those cases the multiplication (1x2xC) x (2x1xC) is performed. The end result is a (1x1x32) matrix. Despite the name, this is not a true element-wise multiplication.
Convolution operation
Convolutional operations work similar to element-wise multiplication. In every step the kernel will be multiplied with the matrix before it is shifted to the next position. The same is done for each channel. The difference to a regular element-wise multiplication is that after the multiplications for each channel have been completed the resulting matrices are added together so that the final result will have only one channel.
Considerations when using bfloat16 data type
When multiplying with a scalar bfloat16 it will be internally cast to float which influences the rounding behaviour with negation. The following example shows how this behaviour affects the multiplication. As the cast involves a rounding operation it matters if the negation is performed before or after the cast. In the first case, the rounding happens to the positive result before the negation. For the second and third case the rounding happens before that which will lead to a different result.
auto v1 = -(a * v[0]);
auto v2 = (-a * v[0]);
auto v3 = (a * -v[0]);
Definition me_bfloat16.h:72
Considerations when using emulated FP32 Intrinsics
elementwise multiplication and matrix multiplication intrinsics for FP32 input type are emulated using bfloat16 data-path. There are 3 options to chose from. Default option (Most accurate but slow):
### _accuracy_safe intrinsics
Most accurate option since input fp32 number is split in to 3
bfloat16 numbers to extract all the bits of the mantissa.
float a, b;
a*b would require 9 mac operations due to 3
bfloat16 splits each.
Fast and accurate option:
### _accuracy_fast intrinsics
Application compile time flag "AIE2_FP32_EMULATION_ACCURACY_FAST": Fast and Accurate option.
Input fp32 number is split in to 3
bfloat16 numbers to extract all the bits of the mantissa.
float a,b;
both a and b are split in to 3
bfloat16 numbers each. Hence there would be 9 mac operations in multiplication of a and b.
In the 9 mac operations to emulate fp32 mul, mac operations with LSBs are ignored. (3 last terms).
This helps improve cycle count of mul and has least impact on accuracy of result.
float a, b;
a*b would require 6 mac operations.
Fastest option with loss of accuracy:
### _accuracy_low intrinsics
Application compile time flag "AIE2_FP32_EMULATION_ACCURACY_LOW": Fast and least accurate option.
Input fp32 number is split in to 2
bfloat16 numbers. Hence not all the bits from mantissa can be used.
float a,b;
Both a and b are split in to 2
bfloat16 numbers each. Hence there would be 4 mac operations in multiplication of a and b.
In the 4 mac operations to emulate fp32 mul, mac operations with LSBs are ignored. (1 last term).
This helps improve cycle count of mul
float a, b;
a*b would require 3 mac operations.
|
_INLINE v8caccfloat | mac_elem_8_conf (v8cfloat v1, v8float v2, v8caccfloat acc, int zero_acc, int sub_mask, int sub_mul, int sub_acc1) |
|
_INLINE v8caccfloat | msc_elem_8_conf (v8cfloat v1, v8float v2, v8caccfloat acc, int zero_acc, int sub_mask, int sub_mul, int sub_acc1) |
|
_INLINE v8caccfloat | mul_elem_8_conf (v8cfloat v1, v8float v2, int sub_mask, int sub_mul) |
|
_INLINE v8caccfloat | negmul_elem_8_conf (v8cfloat v1, v8float v2, int sub_mask, int sub_mul) |
|
|
v16accfloat | mul_elem_16_2 (v32bfloat16 a, v32bfloat16 b) |
|
v16accfloat | negmul_elem_16_2 (v32bfloat16 a, v32bfloat16 b) |
|
v16accfloat | mac_elem_16_2 (v32bfloat16 a, v32bfloat16 b, v16accfloat acc1) |
|
v16accfloat | msc_elem_16_2 (v32bfloat16 a, v32bfloat16 b, v16accfloat acc1) |
|
v16accfloat | addmac_elem_16_2 (v32bfloat16 a, v32bfloat16 b, v16accfloat acc1, v16accfloat acc2) |
|
v16accfloat | addmsc_elem_16_2 (v32bfloat16 a, v32bfloat16 b, v16accfloat acc1, v16accfloat acc2) |
|
|
v16accfloat | mac_elem_16_2_conf (v32bfloat16 a, v32bfloat16 b, v16accfloat acc1, int zero_acc1, int sub_mul, int sub_acc1) |
|
v16accfloat | msc_elem_16_2_conf (v32bfloat16 a, v32bfloat16 b, v16accfloat acc1, int zero_acc1, int sub_mul, int sub_acc1) |
|
v16accfloat | addmac_elem_16_2_conf (v32bfloat16 a, v32bfloat16 b, v16accfloat acc1, v16accfloat acc2, int zero_acc1, int sub_mul, int sub_acc1, int sub_acc2) |
|
v16accfloat | addmsc_elem_16_2_conf (v32bfloat16 a, v32bfloat16 b, v16accfloat acc1, v16accfloat acc2, int zero_acc1, int sub_mul, int sub_acc1, int sub_acc2) |
|
◆ addmac_elem_16_2()
- Parameters
-
a | Matrix A |
b | Matrix B |
acc1 | Accumulator 1 input |
acc2 | Accumulator 2 input |
- Returns
- Result of operation
◆ addmac_elem_16_2_conf()
- Parameters
-
a | Matrix A |
b | Matrix B |
acc1 | Accumulator 1 input |
acc2 | Accumulator 2 input |
zero_acc1 | Zeroing mask for acc1 |
sub_mul | Negation mask of multiplication result |
sub_acc1 | Negation mask of acc1 |
sub_acc2 | Negation mask of acc2 |
- Returns
- Result of operation
◆ addmsc_elem_16_2()
- Parameters
-
a | Matrix A |
b | Matrix B |
acc1 | Accumulator 1 input |
acc2 | Accumulator 2 input |
- Returns
- Result of operation
◆ addmsc_elem_16_2_conf()
- Parameters
-
a | Matrix A |
b | Matrix B |
acc1 | Accumulator 1 input |
acc2 | Accumulator 2 input |
zero_acc1 | Zeroing mask for acc1 |
sub_mul | Negation mask of multiplication result |
sub_acc1 | Negation mask of acc1 |
sub_acc2 | Negation mask of acc2 |
- Returns
- Result of operation
◆ mac_elem_16_2()
- Parameters
-
a | Matrix A |
b | Matrix B |
acc1 | Accumulator 1 input |
- Returns
- Result of operation
◆ mac_elem_16_2_conf()
- Parameters
-
a | Matrix A |
b | Matrix B |
acc1 | Accumulator 1 input |
zero_acc1 | Zeroing mask for acc1 |
sub_mul | Negation mask of multiplication result |
sub_acc1 | Negation mask of acc1 |
- Returns
- Result of operation
◆ mac_elem_8_conf()
- Parameters
-
v1 | Matrix A |
v2 | Matrix B |
acc | Accumulator 1 input |
- Returns
- Result of operation
◆ msc_elem_16_2()
- Parameters
-
a | Matrix A |
b | Matrix B |
acc1 | Accumulator 1 input |
- Returns
- Result of operation
◆ msc_elem_16_2_conf()
- Parameters
-
a | Matrix A |
b | Matrix B |
acc1 | Accumulator 1 input |
zero_acc1 | Zeroing mask for acc1 |
sub_mul | Negation mask of multiplication result |
sub_acc1 | Negation mask of acc1 |
- Returns
- Result of operation
◆ msc_elem_8_conf()
- Parameters
-
v1 | Matrix A |
v2 | Matrix B |
acc | Accumulator 1 input |
- Returns
- Result of operation
◆ mul_elem_16_2()
- Parameters
-
- Returns
- Result of operation
◆ mul_elem_16_2_conf()
- Parameters
-
a | Matrix A |
b | Matrix B |
sub_mul | Negation mask for multiplication result |
- Returns
- Result of operation
◆ mul_elem_8_conf()
- Parameters
-
- Returns
- Result of operation
◆ negmul_elem_16_2()
- Parameters
-
- Returns
- Result of operation
◆ negmul_elem_16_2_conf()
- Parameters
-
a | Matrix A |
b | Matrix B |
sub_mul | Negation mask for multiplication result. If a bit of sub_mul is set the corresponding vector lane of the output accumulator will be negated. |
- Returns
- Result of operation
◆ negmul_elem_8_conf()
- Parameters
-
- Returns
- Result of operation