![]() |
AI Engine
(AIE) r2p15.2
|
Intrinsics that operate on vectors but don't perform a multiplication follow a reduced or modified lane selection scheme with respect to macs/muls. Such operations are adds, subs, abs, vector compares or vector selections/shuffles. Since those instructions share the initial part of the integer macs/mults datapath, they operate mostly on fixed point numbers. The only exception are float select and shuffles because no arithmetic is performed. Floating point arithmetic is always done in the floating point datapath, more information here. The next table summarizes the lane selection scheme.
RA: Regular selection scheme.
RA_16: Selection scheme used for 16 bits numbers.
na: Not implemented at intrinsics level.
fpdp: Floating point datapath.
i32 | i16 | ci32 | ci16 | float | cfloat | |
---|---|---|---|---|---|---|
select/shuffle | RA | RA_16. Offsets are relative to 32 bit. Start is relative to 16 bit but must be multiple of 2. Square relative to 16 bits. | RA. Start and offsets are relative to a full ci32 (64bits). | RA. Real and Imag are never split. | RA | RA. Start and offsets are relative to full cfp32 (64 bits) |
add/sub | RA | RA_16. Offsets are relative to 32 bit. Start is relative to 16 bit but must be multiple of 2. Square relative to 16 bits. | RA. Start and offsets are relative to a full ci32. (64bits). | RA_16 modified. Start is doubled to represent a full ci16. (32bits). Offset follows the RA_16 scheme. 16 bits permute is disabled. | fpdp | fpdp |
abs | RA | RA_16. Offsets are relative to 32 bit. Start is relative to 16 bit but must be multiple of 2. Square relative to 16 bits. | na | na | fpdp | fpdp |
cmp | RA | RA_16. Offsets are relative to 32 bit. Start is relative to 16 bit but must be multiple of 2. Square relative to 16 bits. | na | na | fpdp | fpdp |
The basic functionality of these intrinsics performs vector comparisons between data from two buffers, the X and Y buffers, with the other parameters and options allowing flexibility (data selection within the vectors). When a single input buffer is used both X and Y inputs are obtained (with the respective start/offsets/square parameters) from the input buffer.
Doing "+1" always mean to advance by one lane in the input buffer. The bit width of the datatype is irrelevant.
for i in 0,rows: id[i] = start + offset[i] %input samples out[i] = f( in[id[i]] ) //f can be add, abs, sel ...
//in and out are always treated as 16bits vectors, in[i] and in[i+1] are 16bits apart // First permutation stage The concepts are simple: - N offsets covers N*2 output lanes -> 2*idx - This means that each offset is used to move two adjacent values -> perm_idx + 1 - The parity of the idx selects the perm_idx formula for (idx = 0 ; idx < acc_lanes/2; idx += 1) if even idx: perm_idx = start + 2*offset[idx] else //odd idx perm_idx = start + 2*offset[idx] + 2*(offset[idx – 1] + 1 ) data[2*idx ] = input[ perm_idx ] data[2*idx+1] = input[ perm_idx + 1] //This is just the adjacent one // Second permutation stage for ( idx = 0 ; idx < acc_lanes; idx += 4) // Square is used to permute on a smaller granularity output[idx] = data[ square [ idx%4 ] ] output[idx+1] = data[ square [ idx%4 + 1 ] ] output[idx+2] = data[ square [ idx%4 + 2 ] ] output[idx+3] = data[ square [ idx%4 + 3 ] ]
Visually, what happens is the following (example for the first two idx):
- Assume that the even offset selects [c,d] and the odd offsets selets [g,h] (as an example) in = | a | b [ c | d ] e | f [ g | h ] i | l | m | f(offset_[idx0])--^ ^ g(offset_[idx1])------------------| Here the functions f,g represents the ones described in the previous pseudocode. - Then, data is shaped like this data = | c | d | g | h | ..... The next ones are selected by idx2, idx3 ... - The square parameter finalizes the permutations, assume square 0x0123 out[0] = data[ square[0] ] = data[ 3 ] = h out[1] = data[ square[1] ] = data[ 2 ] = g out[2] = data[ square[2] ] = data[ 1 ] = d out[3] = data[ square[3] ] = data[ 0 ] = c .. - And hence out = | h | g | d | c | .....
The general naming convention for the integer vector intrinsics is shown below:
{ge|gt|le|lt|max|maxdiff|min}{16|32}
The general naming convention for the floating vector compare intrinsics is shown below:
fp{ge|gt|le|lt|max|min}
When the output has more than 8 lanes (e.g. 16) there are extra offset parameters. Apart from the usual 'offsets' parameter there is an additional 'offsets_hi' parameter for the extra lanes. This extra parameter allows selecting the data that will be placed into the upper input lanes (8-16) of the multiplier.
We have a usecase to transpose a matrix contained in a v64int16. Note this is real 16-bit so we will use the real data scheme described above with the select32 intrinsic (Vector Lane Selection).
Our input data is packed as 2x2 tiles in vector registers and we would also like to output in this same format. Input:
In this case we would use the following indexing for the matrix transpose in “2x2 tiles”
xstart=0, xoffset=0x----0800, xoffset_hi=0x----0a02, xsquare=0x3120 ystart=32, yoffset=0x0800----, yoffset_hi=0x0a02----, ysquare=0x3120 select = b11111111000000001111111100000000
32 outputs in 2x2 tiles:
Constituting the first 4 rows of the transposed matrix w/ 2x2 packing:
If you worry that the 2x2 packing does not conform to the input requirements of the subsequent kernel, it is possible to generate a “row-major” transpose using a 2nd select32.
32 outputs of the 1st select32 in first example:
xstart=0, xoffset=0x15111410, xoffset_hi=0x1d191c18, xsquare=0x3210 ystart=----------------------------------------- select = b00000000000000000000000000000000
32 outputs of the 2nd select32 generating the “row-major” transpose:
Which is the first 4 rows of the “row-major” transpose: