AI Engine  (AIE) r2p15.2
 All Data Structures Functions Variables Groups Pages
Reduced Lane Addressing Scheme

Intrinsics that operate on vectors but don't perform a multiplication follow a reduced or modified lane selection scheme with respect to macs/muls. Such operations are adds, subs, abs, vector compares or vector selections/shuffles. Since those instructions share the initial part of the integer macs/mults datapath, they operate mostly on fixed point numbers. The only exception are float select and shuffles because no arithmetic is performed. Floating point arithmetic is always done in the floating point datapath, more information here. The next table summarizes the lane selection scheme.

Legend

RA: Regular selection scheme.

RA_16: Selection scheme used for 16 bits numbers.

na: Not implemented at intrinsics level.

fpdp: Floating point datapath.

i32 i16 ci32 ci16 float cfloat
select/shuffle RA RA_16.
Offsets are relative to 32 bit.
Start is relative to 16 bit
but must be multiple of 2.
Square relative to 16 bits.
RA. Start and offsets are
relative to a full ci32 (64bits).
RA. Real and Imag
are never split.
RA RA. Start and offsets are
relative to full cfp32 (64 bits)
add/sub RA RA_16.
Offsets are relative to 32 bit.
Start is relative to 16 bit
but must be multiple of 2.
Square relative to 16 bits.
RA. Start and offsets are
relative to a full ci32. (64bits).
RA_16 modified. Start is doubled
to represent a full ci16. (32bits).
Offset follows the
RA_16 scheme.
16 bits permute is disabled.
fpdp fpdp
abs RA RA_16.
Offsets are relative to 32 bit.
Start is relative to 16 bit
but must be multiple of 2.
Square relative to 16 bits.
na na fpdp fpdp
cmp RA RA_16.
Offsets are relative to 32 bit.
Start is relative to 16 bit
but must be multiple of 2.
Square relative to 16 bits.
na na fpdp fpdp
Note
The bit width of the datatype does not imply directly the selection scheme used i.e. with cint16, where the real and imaginary parts are always moved together and hence the addressing follow a general scheme.

The basic functionality of these intrinsics performs vector comparisons between data from two buffers, the X and Y buffers, with the other parameters and options allowing flexibility (data selection within the vectors). When a single input buffer is used both X and Y inputs are obtained (with the respective start/offsets/square parameters) from the input buffer.

Regular Lane Selection Scheme (RA)

Doing "+1" always mean to advance by one lane in the input buffer. The bit width of the datatype is irrelevant.

for i in 0,rows:
    id[i] = start + offset[i]  %input samples
    out[i] = f( in[id[i]] ) //f can be add, abs, sel ...

16bits Lane Selection Scheme (RA_16)

//in and out are always treated as 16bits vectors, in[i] and in[i+1] are 16bits apart

// First permutation stage

The concepts are simple:

  - N offsets covers N*2 output lanes -> 2*idx
  - This means that each offset is used to move two adjacent values -> perm_idx + 1
  - The parity of the idx selects the perm_idx formula

for (idx = 0 ; idx < acc_lanes/2; idx += 1)

  if even idx:
    perm_idx = start + 2*offset[idx]

  else //odd idx
    perm_idx = start + 2*offset[idx] + 2*(offset[idx – 1] + 1 )

  data[2*idx  ]  =  input[ perm_idx    ]
  data[2*idx+1]  =  input[ perm_idx + 1] //This is just the adjacent one


// Second permutation stage
for ( idx = 0 ; idx < acc_lanes; idx += 4)

  // Square is used to permute on a smaller granularity
  output[idx]    =  data[ square [ idx%4     ] ]
  output[idx+1]  =  data[ square [ idx%4 + 1 ] ]
  output[idx+2]  =  data[ square [ idx%4 + 2 ] ]
  output[idx+3]  =  data[ square [ idx%4 + 3 ] ]

Visually, what happens is the following (example for the first two idx):

- Assume that the even offset selects [c,d] and the odd offsets selets [g,h] (as an example)

  in =    | a | b [ c | d ] e | f [ g | h ] i | l | m |
  f(offset_[idx0])--^               ^
  g(offset_[idx1])------------------|

  Here the functions f,g represents the ones described in the previous pseudocode.

- Then, data is shaped like this
  data = | c | d | g | h | ..... The next ones are selected by idx2, idx3 ...

- The square parameter finalizes the permutations, assume square 0x0123

  out[0] = data[ square[0] ] = data[ 3 ] = h
  out[1] = data[ square[1] ] = data[ 2 ] = g
  out[2] = data[ square[2] ] = data[ 1 ] = d
  out[3] = data[ square[3] ] = data[ 0 ] = c
    ..

- And hence

  out = | h | g | d | c | .....
Note
The first permuation only accepts 32bit aligned indexes, hence start must be a multiple of 2.

Integer intrinsic naming convention

The general naming convention for the integer vector intrinsics is shown below:

  {ge|gt|le|lt|max|maxdiff|min}{16|32}
Note
The 16 or 32 in the intrinsic name refers to the number of lanes returned in the output. For the intrinsics under Vector Comparison "lane" is a bit in the return word while for all others it is a word of either 16 or 32 bit (according to input size).

Floating-point intrinsic naming convention

The general naming convention for the floating vector compare intrinsics is shown below:

  fp{ge|gt|le|lt|max|min}

Data offsetting for more than 8 output lanes

When the output has more than 8 lanes (e.g. 16) there are extra offset parameters. Apart from the usual 'offsets' parameter there is an additional 'offsets_hi' parameter for the extra lanes. This extra parameter allows selecting the data that will be placed into the upper input lanes (8-16) of the multiplier.

Examples

We have a usecase to transpose a matrix contained in a v64int16. Note this is real 16-bit so we will use the real data scheme described above with the select32 intrinsic (Vector Lane Selection).
Our input data is packed as 2x2 tiles in vector registers and we would also like to output in this same format. Input:

00 01 10 11 02 03 12 13 04 05 14 15 06 07 16 17 20 21 30 31 22 23 32 33 24 25 34 35 26 27 36 37
0 8 16 24
40 41 50 51 42 43 52 53 44 45 54 55 46 47 56 57 60 61 70 71 62 63 72 73 64 65 74 75 66 67 76 77
32 40 48 56

In this case we would use the following indexing for the matrix transpose in “2x2 tiles”

xstart=0,  xoffset=0x----0800, xoffset_hi=0x----0a02, xsquare=0x3120
ystart=32, yoffset=0x0800----, yoffset_hi=0x0a02----, ysquare=0x3120
select = b11111111000000001111111100000000

32 outputs in 2x2 tiles:

00 10 01 11 20 30 21 31 40 50 41 51 60 70 61 71 02 12 03 13 22 32 23 33 42 52 43 53 62 72 63 73
0 8 16 24

Constituting the first 4 rows of the transposed matrix w/ 2x2 packing:

00 10 20 30 40 50 60 70
01 11 21 31 41 51 61 71
02 12 22 32 42 52 62 72
03 13 23 33 43 53 63 73

If you worry that the 2x2 packing does not conform to the input requirements of the subsequent kernel, it is possible to generate a “row-major” transpose using a 2nd select32.

32 outputs of the 1st select32 in first example:

00 10 01 11 20 30 21 31 40 50 41 51 60 70 61 71 02 12 03 13 22 32 23 33 42 52 43 53 62 72 63 73
0 8 16 24
xstart=0,  xoffset=0x15111410, xoffset_hi=0x1d191c18, xsquare=0x3210
ystart=-----------------------------------------
select = b00000000000000000000000000000000

32 outputs of the 2nd select32 generating the “row-major” transpose:

00 10 20 30 40 50 60 70 01 11 21 31 41 51 61 71 02 12 22 32 42 52 62 72 03 13 23 33 43 53 63 73
0 8 16 24

Which is the first 4 rows of the “row-major” transpose:

00 10 20 30 40 50 60 70
01 11 21 31 41 51 61 71
02 12 22 32 42 52 62 72
03 13 23 33 43 53 63 73