![]() |
AI Engine-ML v2 Intrinsics User Guide
v2025.1
|
Intrinsics allowing you to perform MUL/MAC operations and a few of their variants. More...
Intrinsics allowing you to perform MUL/MAC operations and a few of their variants.
For integer datatypes, a matrix A of size MxN is multiplied with a matrix B of size NxP. The naming convention for these operations is: [operation][_MxN_NxP]{_Cch}{_conf} or [operation]_conv_MxN{_Cch}{_conf}. Properties in [] are mandatory, properties in {} are optional. In this naming, conv indicates a convolutional operation, conf indicates the use of sub, zero or shift masks and C gives the number of channels.
For an MxN vector multiply convolution operation, the calculation performed is:
\[ \text{mul_conv_MxN}(F,G) = \sum_{u=0}^{\text{N}-1}{G(u) F(x+u)} \]
where the vector \(F\) has length \(\text{M}+\text{N}-1\), and the vector \(G\) has length \(\text{N}\).
For element-wise operations, the naming is [operation_elem_C]{_N}. Here, C is the number of channels and N is the number of columns of matrix A/rows of matrix B. N is either two or it is omitted. The element-wise operations are executed channel by channel. The output will also be a matrix of with C channels.
For complex datatypes, a multiplication of two matrices with complex elements is performed. The naming convention for these operations is [operation_elem_8/16]{_conf} for Multiply-accumulate of 32b x 16b complex integer datatypes and [operation_elem_8/8_2/16/16_2]{_conf} for Multiply-accumulate of 16b x 16b complex integer datatypes. Here, eight is the number of channels and the two is the number columns of matrix A/rows of matrix B. The matrix multiplication is performed indvidually for each channel of the input matrices. The output will also be a matrix with eight channels.
The following table shows the matrix multiplications that can be completed within a single cycle.
Precision Mode | Channels | Matrix A | Matrix B | Matrix C |
---|---|---|---|---|
8-bit x 8-bit = 32-bit | 1 | 8x8 | 8x8 | 8x8 |
8-bit x 8-bit = 32-bit | 1 | 8x16 | (16x8)T (sparse) | 8x8 |
8-bit x 8-bit = 32-bit | 1 | 4x16 | (16x8)T (sparse) | 4x8* |
8-bit x 8-bit = 32-bit | 1 | 4x8 | 8x16 | 4x16 |
8-bit x 8-bit = 32-bit | 1 | 4x16 | (16x16)T (sparse) | 4x16 |
8-bit x 8-bit = 32-bit | 64 | 1x2 | 2x1 | 1x1 |
8-bit x 8-bit = 32-bit | 64 | 1x1 | 1x1 | 1x1 |
8-bit x 8-bit = 32-bit | 8 | 8x8 (conv.) | 8x1 | 8x1 |
8-bit x 8-bit = 32-bit | 1 | 64x8 (conv.) | 8x1 | 64x1 |
16-bit x 16-bit = 32-bit | 1 | 8x2 | 2x8 | 8x8 |
16-bit x 16-bit = 32-bit | 64 | 1x1 | 1x1 | 1x1 |
16-bit x 16-bit = 32-bit | 32 | 1x1 | 1x1 | 1x1* |
16-bit x 16-bit = 64-bit | 1 | 4x4 | 4x8 | 4x8 |
16-bit x 16-bit = 64-bit | 1 | 4x8 | (8x8)T (sparse) | 4x8 |
16-bit x 16-bit = 64-bit | 32 | 1x2 | 2x1 | 1x1 |
16-bit x 16-bit = 64-bit | 32 | 1x1 | 1x1 | 1x1 |
16-bit x 16-bit = 64-bit | 1 | 32x4 (conv.) | 4x1 | 32x1 |
16-bit x 16-bit = 64-bit | 8 | 4x4 (conv.) | 4x1 | 4x1 |
Complex 16-bit x Complex 16-bit = 64-bit | 16 | 1x2 | 2x1 | 1x1 |
Complex 16-bit x Complex 16-bit = 64-bit | 16 | 1x1 | 1x1 | 1x1 |
32-bit x 16-bit = 64-bit | 1 | 4x2 | 2x8 | 4x8 |
Complex 32-bit x Complex 16-bit = 64-bit | 16 | 1x1 | 1x1 | 1x1 |
Complex 32-bit x Complex 16-bit = 64-bit | 8 | 1x1 | 1x1 | 1x1* |
fp8 x fp8 = fp32 | 1 | 8x8 | 8x8 | 8x8 |
fp8 x fp8 = fp32 | 1 | 4x16 | (16x16)T (sparse) | 4x16 |
fp16 x fp16 = fp32 | 64 | 1x1 | 1x1 | 1x1 |
fp16 x fp16 = fp32 | 32 | 1x1 | 1x1 | 1x1* |
fp16 x fp16 = fp32 | 1 | 4x8 | 8x8 | 4x8 |
fp16 x fp16 = fp32 | 1 | 4x16 | (16x8)T (sparse) | 4x8 |
bfloat16 x bfloat16 = fp32 | 64 | 1x1 | 1x1 | 1x1 |
bfloat16 x bfloat16 = fp32 | 32 | 1x1 | 1x1 | 1x1* |
bfloat16 x bfloat16 = fp32 | 1 | 4x8 | 8x8 | 4x8 |
bfloat16 x bfloat16 = fp32 | 1 | 4x16 | (16x8)T (sparse) | 4x8 |
MX9 x MX9 = fp32 | 1 | 4x8 4x8 | (8x16)T (8x16)T | 4x16 |
MX6 x MX6 = fp32 | 1 | 4x16 | (16x16)T | 4x16 |
We can summarize the MUL and the MAC operation like this:
The 'x' operator being the matrix multiplication operator. The same way we can summarize the MSC, NEGMUL, MACMUL and MAC/MSC variants with additional acc_in2 input operations as this:
The convolve variants of these intrinsics differs as they apply a convolution product on the vectors instead of a matrix multiplication. The '*' operator being the vector convolution operator. Therefore, the X_vec is the matrix, and Y_vec the kernel.
Some variant allow the passing of masks that are used to determine sign, zeroing and negation of vector or accumulator lanes. These masks are the following:
Complex multiplications require some terms to be negated in order to implement conjugation and minus j multiplication. This is done through the sub_mask. The following examples show how this mask is used when two complex numbers, X and Y, are multiplied to get an output O. For Multiply-accumulate of 16b x 16b complex integer datatypes there are two complex numbers post-added. They are indicated by the postfix 0/1:
For Multiply-accumulate of 32b x 16b complex integer datatypes there is no postadding and only four unique terms are needed. However, all 8 bit must be specified apropriately. In the following equation the index bits used for one term must be the same value.
Some intrinsics are used for multiplications of matrices with a given number of channels. Each MxN matrix is stored in row-major and channel-minor fashion. The following example shows the resulting layout of elements in the vector for a 4x4 matrix with two channels. The indexes for each element are given as (m,n,c)
The elem variants allow you to perform element-wise operations. The operations are performed along the channels. For example, if you perform a (1x1x32) x (1x1x32) operation a multiplication will be done between the elements of the same channel. So, the elements of channel zero will be multiplied, the elements of channel one will be multiplied etc. The end result would again have 32 channels.
Some of the elem variants perform matrix multiplications along the channels. For those cases the multiplication (1x2xC) x (2x1xC) is performed. The end result is a (1x1x32) matrix. Despite the name, this is not a true element-wise multiplication.
Convolutional operations work similar to element-wise multiplication. In every step the kernel will be multiplied with the matrix before it is shifted to the next position. The same is done for each channel. The difference to a regular element-wise multiplication is that after the multiplications for each channel have been completed the resulting matrices are added together so that the final result will have only one channel.
When multiplying with a scalar bfloat16 it will be internally cast to float which influences the rounding behaviour with negation. The following example shows how this behaviour affects the multiplication. As the cast involves a rounding operation it matters if the negation is performed before or after the cast. In the first case, the rounding happens to the positive result before the negation. For the second and third case the rounding happens before that which will lead to a different result.
elementwise multiplication and matrix multiplication intrinsics for FP32 input type are emulated using bfloat16 data-path. There are 3 options to chose from.
Default option (most accurate but slow):
Fast and accurate option:
Fastest option with loss of accuracy: