# Runtime Operation Memory Aliasing

## Table of Contents

1. [Overview](#overview)
1. [Runtime Operations](#runtime-operations)
1. [Gather Runtime](#gather-runtime)
1. [Slice Runtime](#slice-runtime)
1. [Concat Runtime](#concat-runtime)
1. [Split Runtime](#split-runtime)
1. [Advanced Patterns](#advanced-patterns)
1. [Memory Layout Examples](#memory-layout-examples)
1. [Comparison with Noop Aliasing](#comparison-with-noop-aliasing)
1. [Usage and Integration](#usage-and-integration)
1. [Assumptions and Limitations](#assumptions-and-limitations)

---

## Overview

Runtime operations are structural transformation operations. Unlike noop operations (which only change metadata like shape) and compute operations (which always create new data), runtime operations **conditionally transform data** based on ONNX model parameters, requiring sophisticated memory aliasing strategies.

**Important Context:**

- **Data Format**: NHWC (batch, height, width, channels)
- **C-dimension Padding**: The last dimension (C) is padded to multiples of 64 for performance
- **Operation Axis**: Runtime operations support any axis, but memory aliasing optimization requires all dimensions before the axis to be unity (1). When this condition is met, operations on axis=k are equivalent to axis=0 in memory layout

**Key Characteristics:**

- **Behavior from ONNX attributes**: Operations may read parameters like `starts`, `ends`, `axis`, or `split` from the model
- **In-place potential**: Can reuse buffers when size relationships are favorable
- **Contiguous allocation**: Some operations require strict memory layout constraints
- **Offset-based aliasing**: Outputs can point to specific offsets within input buffers

**Supported Runtime Operations:**

1. **`gather_runtime`**: Max-size aliasing (input/output share buffer sized to larger)
2. **`slice_runtime`**: Offset-based aliasing (output points into input at specific offset)
3. **`concat_runtime`**: Contiguous input aliasing (inputs laid out sequentially, output at start)
4. **`split_runtime`**: Contiguous output aliasing (outputs laid out sequentially from input)

---

## Runtime Operations

### What Makes an Operation "Runtime"?

Runtime operations have characteristics of both noop and compute operations:

| Aspect                     | Noop Ops     | Runtime Ops          | Compute Ops |
|----------------------------|--------------|----------------------|-------------|
| **Data Modification**      | None         | Conditional          | Always      |
| **Memory Reuse**           | Always alias | Conditional alias    | Never alias |
| **Size Relationship**      | Same size    | Variable size        | Independent |
| **Offset Calculation**     | None         | From ONNX attributes | N/A         |
| **Contiguity Requirement** | None         | Sometimes required   | None        |

**Common Pattern**: Runtime operations read ONNX model attributes (like `starts`, `split_index`, `axis`) to determine:

- Which portions of input to access
- Where to place outputs in memory
- Whether in-place operation is possible

---

## Gather Runtime

### Overview

`gather_runtime` selects elements from an input tensor based on indices specified at runtime. The operation gathers elements along a specified axis according to ONNX Gather semantics.

**ONNX Inputs:**

- `data`: Input tensor to gather from (NHWC format)
- `indices`: Indices to gather (typically small metadata tensor)

**ONNX Attributes:**

- `axis`: Axis along which to gather (any axis supported, but memory aliasing requires all dimensions before axis to be 1)

**Output Shape:** For axis=k gathering: output shape has same dimensions as input except at axis=k

### Memory Strategy

**Allocation Size**: `max(size_data, size_output)`

Since this is a **static ONNX graph**, all tensor shapes (including output shapes) are known at compile/build time through ONNX shape inference.

**Design Assumption**: Output size ≤ Input size (gather cannot produce output larger than input)

**Allocation Formula:**

```python
allocation_size = max(
    product(data_shape[:-1] + [round_up(data_shape[-1], 64)]) × sizeof(dtype),
    product(output_shape[:-1] + [round_up(output_shape[-1], 64)]) × sizeof(dtype)
)
```

Where:

- `data_shape` and `output_shape` are known from the ONNX graph (NHWC format: N, H, W, C)
- `round_up(C, 64)` pads the C dimension (last dimension) to next multiple of 64 for performance
- All other dimensions (N, H, W) remain unchanged
- **Note**: Runtime operations only operate on the N dimension (batch dimension)

**Allocator Requirements:**

1. **Read input tensor shape** from ONNX graph metadata (NHWC format)
2. **Read output tensor shape** from ONNX graph metadata (already computed by ONNX shape inference)
3. **Validate axis attribute**: Verify all dimensions before the axis are unity (1)
   - If validation fails, throw error: "gather_runtime: axis={axis} requires all leading dimensions to be 1 for memory aliasing"
4. **Apply C dimension padding** to both shapes: `padded_C = round_up(C, 64)` where C is the last dimension
5. **Calculate physical sizes** using padded shapes:
   - `input_physical_size = product(input_shape[:-1] + [round_up(input_shape[-1], 64)]) × sizeof(dtype)`
   - `output_physical_size = product(output_shape[:-1] + [round_up(output_shape[-1], 64)]) × sizeof(dtype)`
6. **Validate size constraint**: Verify that `output_physical_size ≤ input_physical_size`
   - If violated, throw error: "gather_runtime: output physical size exceeds input physical size after C-dimension padding"
7. **Allocate** `input_physical_size` bytes (since output ≤ input by constraint)
8. **Alias** output to input buffer

**Note**: When all leading dimensions are 1, gather on axis=k is equivalent to gather on axis=0 in memory layout, enabling efficient aliasing.

### Memory Layout

```text
Case 1: Output larger than input
┌───────────────────────────────────────┐
│  Input uses [0, size_input)           │
│  Output uses [0, size_output)         │
│  Allocated: max(input, output)        │
└───────────────────────────────────────┘
|                                       |
└─── Same physical buffer ──────────────┘

Case 2: Input larger than output
┌───────────────────────────────────────┐
│  Input uses [0, size_input)           │
│  Output uses [0, size_output)         │
│  Extra: [size_output, size_input)     │
└───────────────────────────────────────┘
|                                       |
└─── Same physical buffer ──────────────┘
```

### Example

**Example with padding (axis=0 operating on N dimension, where leading dimensions are implicitly 1):**

```python
# ONNX Model: gather_runtime with axis=0 (N dimension in NHWC)
# Data logical shape:    (50, 32, 32, 128)  # NHWC: N=50, H=32, W=32, C=128
# Data padded shape:     (50, 32, 32, 192)  # C: 128 → 192 (round up to multiple of 64)
# Data physical size:    50 × 32 × 32 × 192 × 4 = 9,830,400 bytes (float32)

# Indices input: shape=(30,) - gathering 30 batches from N dimension
# Output logical shape:  (30, 32, 32, 128)  # NHWC: N=30, H=32, W=32, C=128
# Output padded shape:   (30, 32, 32, 192)  # C: 128 → 192 (same padding as input C)
# Output physical size:  30 × 32 × 32 × 192 × 4 = 5,898,240 bytes

# Allocation (using physical sizes)
allocation_size = max(9_830_400, 5_898_240)  # 9,830,400 bytes
address_data = 0x1000
address_output = 0x1000  # Same buffer

# Key insight: Gathering on N dimension doesn't affect C padding
# Output size < Input size, so they share the larger buffer
```

**Note**: ONNX Gather supports arbitrary axis values. This implementation supports any axis where all leading dimensions are unity (1), which makes operations equivalent to axis=0 in memory layout.

---

## Slice Runtime

### Overview

`slice_runtime` extracts a contiguous slice from an input tensor along specified axes. The output points to a specific offset within the input buffer, requiring no additional memory allocation.

**ONNX Attributes:**

- `starts`: Starting indices for each axis
- `ends`: Ending indices for each axis
- `axes`: Which axes to slice along (any axis supported, but memory aliasing requires all dimensions before the axis to be 1)
- `steps`: Step sizes (typically 1)

### Memory Strategy

**Allocation Size**: `size_input` (only input is allocated)

**Key Property**: Slice operations always produce output size ≤ input size (extracting a subset), so the output can always safely alias into the input buffer.

**Output Address**: `address_input + offset`

Where the offset is calculated based on:

- `starts`: Starting indices for each axis (from ONNX attribute)
- `axes`: Which axes to slice along (from ONNX attribute)
- **Important**: Offset calculation must use the **padded input shape**, not logical shape
  - When slicing along axis=k where all leading dimensions are 1, use `padded_C = round_up(C, 64)` for strides
  - Other axes use their original dimensions

**IMPORTANT - Alignment Consideration:**

- The **input buffer base address** is aligned to the configured alignment boundary (e.g., 4KB)
- The **output offset** is calculated relative to the aligned input address
- The **output start address** (`input_aligned_address + byte_offset`) **may NOT be aligned** to the alignment boundary
- This is expected behavior: the offset depends on the slice `starts` indices (from ONNX attributes)
- The runtime must handle potentially unaligned output addresses when accessing slice results

```python
# Given input buffer at aligned address (e.g., 0x1000 aligned to 4KB)
# Calculate output offset relative to the aligned input address
byte_offset = start_k × stride_k × sizeof(dtype)  # k is the normalized axis
output_address = input_aligned_address + byte_offset  # May not be aligned!
```

**Note**: The `ends` and `steps` attributes don't affect the offset - they only determine the output shape (already computed by ONNX shape inference).

**Important**: When all leading dimensions are 1, operations on axis=k are equivalent to axis=0 in memory layout, enabling efficient offset calculation.

**Allocator Requirements:**

1. **Read input tensor shape** from ONNX graph metadata (NHWC format)
2. **Apply C dimension padding** to input shape: `padded_C = round_up(C, 64)` where C is the last dimension
3. **Calculate input physical size** using padded shape
4. **Read and validate slice attributes** from ONNX:
   - Read `axes` attribute and validate all dimensions before the axis are unity (1)
   - If validation fails, throw error: "slice_runtime: axis={axis} requires all leading dimensions to be 1 for memory aliasing"
   - Read `starts` attribute and verify `len(starts) == len(axes)`
   - If `len(starts) != len(axes)`, throw error: "slice_runtime: starts and axes must have the same length"
   - Read `starts[0]` for axis offset calculation
5. **Calculate output byte offset** within the padded input buffer
6. **Allocate** `input_physical_size` bytes for input
7. **Alias** output to `input_address + byte_offset` (no additional allocation needed)

**Byte Offset Calculation (when all leading dimensions are 1):**

```python
# Example: input shape NHWC = (50, 32, 32, 128)
# Slice: starts=[10], axes=[0] (slicing on N dimension, no leading dimensions before axis=0)

# Step 1: Build padded shape
logical_shape = [50, 32, 32, 128]  # NHWC
padded_shape = [50, 32, 32, round_up(128, 64)]  # [50, 32, 32, 192]

# Step 2: Calculate stride for axis=0
# stride_0 = H × W × padded_C (elements per batch)
stride_0 = 32 * 32 * 192  # 196,608 elements per batch

# Step 3: Compute element offset (for axis=0)
start_0 = 10  # From ONNX attribute starts=[10]
element_offset = start_0 * stride_0
# element_offset = 10 * 196,608 = 1,966,080

# Step 4: Convert to byte offset
byte_offset = element_offset * sizeof(dtype)  # e.g., * 4 for float32
# byte_offset = 1,966,080 * 4 = 7,864,320 bytes
```

### Memory Layout

```text
Input Buffer (e.g., 40 elements, 160 bytes):
┌────────────────────────────────────────────────────────┐
│  [0:10)  │  [10:30)           │  [30:40)               │
│   Skip   │   Slice Output     │   Skip                 │
└────────────────────────────────────────────────────────┘
           ↑
           Output points here (offset = 10 × 4 = 40 bytes)

Address Calculation:
address_output = address_input + out_start × sizeof(dtype)
                = 0x1000 + 10 × 4
                = 0x1028
```

### Example

```python
# ONNX Model: slice_runtime with starts=[10], axes=[0] (N dimension, no leading dims)
# Input logical shape:  (50, 32, 32, 128)  # NHWC: N=50, H=32, W=32, C=128 (float32)
# Input padded shape:   (50, 32, 32, 192)  # C: 128 → 192 (round up to multiple of 64)
# Input physical size:  50 × 32 × 32 × 192 × 4 = 9,830,400 bytes

# Output logical shape: (20, 32, 32, 128)  # Slicing N from [10:30]
# Output size: 20 × 32 × 32 × 128 × 4 = 3,145,728 bytes (logical)

# Allocation
allocate(input, size=9_830_400)     # address_input = 0x1000

# Calculate byte offset (axis=0, all leading dimensions are implicitly empty/1)
# Stride for axis=0: H × W × padded_C = 32 × 32 × 192 = 196,608 elements/batch
# Element offset = 10 × 196,608 = 1,966,080
# Byte offset = 1,966,080 × 4 = 7,864,320

address_output = 0x1000 + 7_864_320  # = 0x1000 + 0x77F800 = 0x77F800

# No additional allocation for output!
# Key insight: Offset skips first 10 batches (each batch is H×W×padded_C elements)
```

---

## Concat Runtime

### Overview

`concat_runtime` concatenates multiple tensors along a specified axis. All inputs must be allocated **contiguously** in memory, and the output aliases the start of the first input.

**ONNX Attributes:**

- `axis`: Axis along which to concatenate (any axis supported, but memory aliasing requires all dimensions before the axis to be 1)

**Key Property**: When all leading dimensions are 1, operations on axis=k are equivalent to axis=0 in memory layout. All inputs have the same dimensions except at the concatenation axis.

### Memory Strategy

**Allocation Size**: `∑(physical_size_input_i)` for i in `[0, N-1]`

**Important**: Allocation uses **physical sizes** (with C-dimension padding), not logical sizes.

**Layout Requirement (when all leading dimensions are 1):**

Since all inputs have the same dimensions except at the concatenation axis, they are laid out sequentially in memory:

```
address_input_1 = address_input_0 + physical_size_input_0
address_input_2 = address_input_1 + physical_size_input_1
...
address_input_N = address_input_(N-1) + physical_size_input_(N-1)
```

Where `physical_size_input_i` accounts for C-dimension padding.

**Output Address**: `address_input_0` (points to start of first input)

**Allocator Requirements:**

1. **Read all input tensor shapes** from ONNX graph metadata (NHWC format)
2. **Validate axis attribute**: Verify all dimensions before the axis are unity (1)
   - If validation fails, throw error: "concat_runtime: axis={axis} requires all leading dimensions to be 1 for memory aliasing"
3. **Verify all inputs have same dimensions except at concatenation axis**
4. **Apply C dimension padding**: `padded_C = round_up(C, 64)` (same for all inputs)
5. **Calculate physical size for each input** accounting for padding
6. **Compute total allocation size**: `total = ∑(physical_size_input_i)`
7. **Allocate contiguous region** of `total` bytes
8. **Assign addresses sequentially**: `address_input_i = base_address + ∑(physical_size_input_j for j < i)`
9. **Alias output** to base address (same as `address_input_0`)

### Memory Layout

```text
Concat with 3 inputs (shapes: [10], [20], [15]):
┌──────────┬────────────────────┬───────────────┐
│ Input 0  │     Input 1        │   Input 2     │
│ 40 bytes │    80 bytes        │  60 bytes     │
└──────────┴────────────────────┴───────────────┘
↑                               ^───────────────┐
address_input_0 (= address_output)       address_input_2
= 0x1000                                 = 0x1078

Total allocation: 180 bytes (40 + 80 + 60)
Output: points to 0x1000, size = 180 bytes
```

### Example

```python
# ONNX Model: concat_runtime with 3 inputs, axis=0 (no leading dims before axis=0)
# All inputs have same H, W, C: (*, 32, 32, 128) float32

# Input 0 logical shape: (10, 32, 32, 128)  # NHWC: N=10
# Input 0 padded shape:  (10, 32, 32, 192)  # C: 128 → 192
# Input 0 physical size: 10 × 32 × 32 × 192 × 4 = 7,864,320 bytes

# Input 1 logical shape: (20, 32, 32, 128)  # NHWC: N=20
# Input 1 padded shape:  (20, 32, 32, 192)  # C: 128 → 192
# Input 1 physical size: 20 × 32 × 32 × 192 × 4 = 15,728,640 bytes

# Input 2 logical shape: (15, 32, 32, 128)  # NHWC: N=15
# Input 2 padded shape:  (15, 32, 32, 192)  # C: 128 → 192
# Input 2 physical size: 15 × 32 × 32 × 192 × 4 = 11,796,480 bytes

# Output logical shape:  (45, 32, 32, 128)  # N: 10 + 20 + 15 = 45
# Output padded shape:   (45, 32, 32, 192)  # C: 128 → 192 (same as inputs)
# Output physical size:  45 × 32 × 32 × 192 × 4 = 35,389,440 bytes

# Allocation (contiguous, using PHYSICAL sizes)
total_allocation = 7_864_320 + 15_728_640 + 11_796_480 = 35_389_440 bytes
address_input_0 = 0x1000                         # Base address
address_input_1 = 0x1000 + 7_864_320 = 0x781000
address_input_2 = 0x781000 + 15_728_640 = 0x1701000
address_output  = 0x1000                         # Same as input_0

# Key insight: All inputs have same H, W, padded_C (only N differs for axis=0)
# Each input stores N_i complete elements sequentially
```

---

## Split Runtime

### Overview

`split_runtime` is the inverse of concat - it splits one input tensor into multiple outputs along a specified axis. All outputs must be allocated **contiguously** starting from the input address.

**ONNX Attributes:**

- `axis`: Axis along which to split (any axis supported, but memory aliasing requires all dimensions before the axis to be 1)
- `split`: Array specifying size along the split axis for each output

**Key Property**: When all leading dimensions are 1, operations on axis=k are equivalent to axis=0 in memory layout. All outputs have the same dimensions except at the split axis.

### Memory Strategy

**Allocation Size**: `physical_size_input` (input size with C-dimension padding)

**Important**: Allocation uses **physical size** (with C-dimension padding), and output addresses are calculated using the **physical sizes of outputs**, not logical sizes.

**Layout Requirement (when all leading dimensions are 1)**:

Since all outputs have the same dimensions except at the split axis, they are views into the input buffer:

```text
address_output_0 = address_input
address_output_1 = address_output_0 + physical_size_output_0
address_output_2 = address_output_1 + physical_size_output_1
...
address_output_N = address_output_(N-1) + physical_size_output_(N-1)
```

Where `physical_size_output_i` accounts for C-dimension padding and the split size.

**Verification**: `∑(physical_size_output_i) = physical_size_input`

**Allocator Requirements:**

1. **Read input tensor shape** from ONNX graph metadata (NHWC format)
2. **Validate axis attribute**: Verify all dimensions before the axis are unity (1)
   - If validation fails, throw error: "split_runtime: axis={axis} requires all leading dimensions to be 1 for memory aliasing"
3. **Apply C dimension padding** to input shape: `padded_C = round_up(C, 64)`
4. **Calculate input physical size** using padded shape
5. **Read split attribute** from ONNX: Array `split = [split[0], split[1], ..., split[k-1]]` specifying size along split axis for each output
6. **Verify split constraint**: `∑(split[i]) == input_shape[axis]` (split values must sum to input size at axis)
7. **Verify all outputs have same dimensions except at split axis**
8. **Calculate physical size for each output** accounting for padding
9. **Verify total size constraint**: `∑(physical_size_output_i) == physical_size_input`
10. **Allocate** `physical_size_input` bytes for input
11. **Assign output addresses sequentially**: `address_output_i = input_address + ∑(physical_size_output_j for j < i)`

### Memory Layout

```text
Split with 3 outputs from 1 input (sizes: 40, 80, 60 bytes):
Input Buffer (180 bytes):
┌──────────┬────────────────────┬───────────────┐
│ Output 0 │     Output 1       │   Output 2    │
│ 40 bytes │    80 bytes        │  60 bytes     │
└──────────┴────────────────────┴───────────────┘
↑          ↑                    ↑
0x1000     0x1028               0x1078

Input:    [0x1000, 0x10B4) = 180 bytes
Output 0: [0x1000, 0x1028) = 40 bytes
Output 1: [0x1028, 0x1078) = 80 bytes
Output 2: [0x1078, 0x10B4) = 60 bytes
```

### Example

```python
# ONNX Model: split_runtime with split=[10, 20, 15], axis=0 (no leading dims)
# All inputs/outputs have same H, W, C: (*, 32, 32, 128) float32

# Input logical shape:  (45, 32, 32, 128)  # NHWC: N=45, H=32, W=32, C=128
# Input padded shape:   (45, 32, 32, 192)  # C: 128 → 192
# Input physical size:  45 × 32 × 32 × 192 × 4 = 35,389,440 bytes

# Output 0 logical shape: (10, 32, 32, 128)  # NHWC: N=10
# Output 0 padded shape:  (10, 32, 32, 192)  # C: 128 → 192 (same as input)
# Output 0 physical size: 10 × 32 × 32 × 192 × 4 = 7,864,320 bytes

# Output 1 logical shape: (20, 32, 32, 128)  # NHWC: N=20
# Output 1 padded shape:  (20, 32, 32, 192)  # C: 128 → 192
# Output 1 physical size: 20 × 32 × 32 × 192 × 4 = 15,728,640 bytes

# Output 2 logical shape: (15, 32, 32, 128)  # NHWC: N=15
# Output 2 padded shape:  (15, 32, 32, 192)  # C: 128 → 192
# Output 2 physical size: 15 × 32 × 32 × 192 × 4 = 11,796,480 bytes

# Verification: 7_864_320 + 15_728_640 + 11_796_480 = 35,389,440 bytes ✓

# Allocation
address_input    = 0x1000                         # Base address
address_output_0 = 0x1000                         # Same as input
address_output_1 = 0x1000 + 7_864_320 = 0x781000
address_output_2 = 0x781000 + 15_728_640 = 0x1701000

# Key insight: All outputs have same H, W, padded_C (only N differs for axis=0)
# Each output is a view of split[i] consecutive elements from input
```

### Concat-Split Duality and Allocation Strategy

**Key Observation**: Concat is the reverse operation of split, which has important implications for memory allocation:

- **Split is straightforward**: Given one input buffer, we simply partition it into contiguous output views
  - Allocate input buffer
  - Calculate output offsets sequentially
  - Each output points into the input at its offset
  - No complex dependencies

- **Concat is the inverse**: Given multiple input tensors, we need them laid out contiguously to form the output
  - Requires all inputs to be allocated contiguously
  - Output aliases the start of the contiguous region
  - More complex due to input dependencies

**Allocation Strategy**:

Since concat requires pre-arranged contiguous inputs (which is harder to achieve in forward traversal), we leverage the duality:

1. **Split allocation** (forward topological order):
   - Straightforward: allocate input, compute output offsets
   - Natural fit for forward graph traversal

2. **Concat allocation** (reverse topological order):
   - Treat concat like a "reverse split"
   - When traversing in reverse: the concat output is "known", inputs need to be partitioned from it
   - This makes concat allocation as simple as split allocation

**Implementation Approach**:

```python
# Forward pass: allocate split operations easily
for op in forward_topo_order:
    if op.type == "split_runtime":
        allocate_split(op)  # Simple: input → output offsets

# Reverse pass: allocate concat operations like reverse splits
for op in reverse_topo_order:
    if op.type == "concat_runtime":
        allocate_concat_as_reverse_split(op)  # Output → input offsets
```

This dual-traversal strategy simplifies concat allocation by reusing the split allocation logic in reverse.

---

## Advanced Patterns

### Chained Concat Operations

One of the most interesting cases occurs when concat operations are chained together:

```text
Graph Structure:

tensor_1 ────┬──────────────────────────┐
         └────────┘   concat_0_out └──────────┘
             ^                          ^
             │                          │
concat_0 output: concat_0_out (intermediate)
concat_1 inputs: [tensor_1, concat_0_out, tensor_3]
concat_1 output: final output
```

**Challenge**: The output of `concat_0` must be:

**Key Observation**: Both `tensor_1` and `tensor_3` appear in **both** concat operations:

**Memory Layout**:

│ tensor_1 │ tensor_2 │ tensor_3 │ tensor_1 │concat_0  │ tensor_3 │
│          │          │          │ (copy)   │  _out    │ (copy)   │
Problem: tensor_1 and tensor_3 are duplicated! Wastes memory.

```

1. **Detect overlapping inputs**: Recognize that `tensor_1` and `tensor_3` appear in both operations
2. **Allocate shared tensors once**: Place shared tensors at positions satisfying both constraints
3. **Calculate offsets carefully**: Ensure contiguity for both `concat_0` and `concat_1`

Optimized Layout:
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│ tensor_1 │ tensor_1 │ tensor_2 │ tensor_3 │ tensor_3 │
└──────────┴──────────┴──────────┴──────────┴──────────┘
 ↑          ↑                               ↑          ↑
 |          |─── concat_0_out (middle 3)  ──|          |
 |                                                     |
 |──────────────── concat_1 inputs  ───────────────────|

Explanation:
- concat_0_out points to the middle three blocks: [tensor_1, tensor_2, tensor_3]
- concat_1 inputs:
  * tensor_1: points to first block (offset 0)
  * concat_0_out: points to second block (offset = size(tensor_1)), spans 3 tensors
  * tensor_3: points to fifth block (offset = size(tensor_1) × 2 + size(tensor_2))

Total memory: 2×size(tensor_1) + size(tensor_2) + 2×size(tensor_3)
```

**Detailed breakdown for concat_1:**

```text
Inputs to concat_1:
  - Input 0: tensor_1 (points to offset 0)
  - Input 1: concat_0_out (points to offset size(tensor_1), spans [t1, t2, t3])
Final concat_1 output layout:

└──────────┴──────────┴──────────┴──────────┴──────────┘
     │          │          │          │          │
     └─ Input 0 └──────────┴─ Input 1 ┘          └─ Input 2

Explanation:
- Input 0 (tensor_1) contributes once at the beginning
- Input 1 (concat_0_out = [t1, t2, t3]) contributes all three tensors
- Input 2 (tensor_3) contributes once at the end
```

### Multi-Level Aliasing

Runtime operations can be combined with noop operations:

```text
input → slice_runtime → reshape_noop → output

Memory:
┌────────────────────────────────┐
│       Input Buffer             │
│  [offset:offset+size]          │  <- slice points here
│   [same location]              │  <- reshape aliases slice
└────────────────────────────────┘
```

**Benefits**:

- Single allocation for entire chain
- Offset calculation done once
- Zero additional memory for reshape

---

## Memory Layout Examples

### Example 1: Gather with Larger Output

```text
ONNX: gather_runtime(input: NHWC shape (100, 16, 16, 64), indices: [150 batches])
Input logical:  (100, 16, 16, 64)  float32
Input padded:   (100, 16, 16, 64)  # C=64 already multiple of 64
Input physical: 100 × 16 × 16 × 64 × 4 = 6_553_600 bytes

Output logical: (150, 16, 16, 64)  float32
Output padded:  (150, 16, 16, 64)
Output physical: 150 × 16 × 16 × 64 × 4 = 9_830_400 bytes

Memory:
┌────────────────────────────────────────┐
│  Buffer (9_830_400 bytes)              │
│  = max(6_553_600, 9_830_400)           │
├────────────────────────────────────────┤
│  Step 1: Input at [0, 6_553_600)       │
│  Step 2: Output at [0, 9_830_400)      │
│          (overwrites and extends)      │
└────────────────────────────────────────┘

Allocation: 9_830_400 bytes total
Saving: 6_553_600 bytes (no separate output buffer)
```

### Example 2: Slice with Offset

```text
ONNX: slice_runtime(starts=[100], ends=[200], axes=[0])
Input logical:  (1000, 16, 16, 64)  NHWC float32
Input padded:   (1000, 16, 16, 64)  # C=64 already multiple of 64
Input physical: 1000 × 16 × 16 × 64 × 4 = 65_536_000 bytes

Output: Slicing batches [100:200], yields (100, 16, 16, 64)
Output size: 100 × 16 × 16 × 64 × 4 = 6_553_600 bytes

Memory:
┌─────────────────────────────────────────────────────────┐
│ [0:100)        │ [100:200)       │ [200:1000)           │
│ Skip 100 batch │ Output 100 batch│ Skip 800 batches     │
│ 6_553_600B     │ 6_553_600B      │ 52_428_800B          │
└─────────────────────────────────────────────────────────┘
                 ↑
                 Output points here

Address Calculation:
byte_offset = 100 × (16 × 16 × 64) × 4 = 100 × 16_384 × 4 = 6_553_600
address_output = address_input + 6_553_600
                = 0x1000 + 0x640000 = 0x641000

Allocation: 65,536,000 bytes total
Saving: 6,553,600 bytes (output points into input, no copy)
```

### Example 3: Concat with 4 Inputs

```text
ONNX: concat_runtime(inputs: 4 tensors, axis=0, all with H=16, W=16, C=64)
Input 0: (10, 16, 16, 64)   NHWC float32 → Physical: 10 × 16 × 16 × 64 × 4 = 655_360 bytes
Input 1: (20, 16, 16, 64)   NHWC float32 → Physical: 20 × 16 × 16 × 64 × 4 = 1_310_720 bytes
Input 2: (15, 16, 16, 64)   NHWC float32 → Physical: 15 × 16 × 16 × 64 × 4 = 983_040 bytes
Input 3: (25, 16, 16, 64)   NHWC float32 → Physical: 25 × 16 × 16 × 64 × 4 = 1_638_400 bytes
Output:  (70, 16, 16, 64)   NHWC float32 → Physical: 70 × 16 × 16 × 64 × 4 = 4_587_520 bytes
         (N = 10 + 20 + 15 + 25 = 70)

Memory:
┌─────────┬────────────┬──────────┬────────────┐
│   I0    │     I1     │    I2    │     I3     │
│ 655_360B│ 1_310_720B │ 983_040B │ 1_638_400B │
│ 10 batch│  20 batch  │ 15 batch │  25 batch  │
└─────────┴────────────┴──────────┴────────────┘
↑                                              ↑
0x1000                                    0x460000
Output view: [0x1000, 0x460000) = 4,587,520 bytes

Address calculations:
address_input_0 = 0x1000
address_input_1 = 0x1000 + 655_360 = 0xA0000
address_input_2 = 0xA0000 + 1_310_720 = 0x1A0000
address_input_3 = 0x1A0000 + 983_040 = 0x290000
address_output  = 0x1000  # Same as input_0

Allocation: 4,587,520 bytes total
Saving: 4,587,520 bytes (output aliases inputs, no separate buffer)
```

### Example 4: Split into 3 Outputs

```text
ONNX: split_runtime(split=[40, 100, 60], axis=0, with H=16, W=16, C=64)
Input: (200, 16, 16, 64)  NHWC float32
       Physical: 200 × 16 × 16 × 64 × 4 = 13,107,200 bytes

Output 0: (40, 16, 16, 64)  → Physical: 40 × 16 × 16 × 64 × 4 = 2_621_440 bytes
Output 1: (100, 16, 16, 64) → Physical: 100 × 16 × 16 × 64 × 4 = 6_553_600 bytes
Output 2: (60, 16, 16, 64)  → Physical: 60 × 16 × 16 × 64 × 4 = 3_932_160 bytes

Verification: 2_621_440 + 6_553_600 + 3_932_160 = 13_107_200 bytes ✓

Memory:
┌─────────────┬─────────────────────────┬───────────────┐
│  Output 0   │       Output 1          │   Output 2    │
│  40 batch   │      100 batch          │   60 batch    │
│ 2_621_440B  │      6_553_600B         │  3_932_160B   │
└─────────────┴─────────────────────────┴───────────────┘
↑             ↑                         ↑
0x1000        0x280000                  0x920000

Address calculations:
address_input    = 0x1000
address_output_0 = 0x1000
address_output_1 = 0x1000 + 2_621_440 = 0x280000
address_output_2 = 0x280000 + 6_553_600 = 0x920000

Allocation: 13,107,200 bytes total
Saving: 13,107,200 bytes (outputs alias input, no separate buffers)
```

---

## Comparison with Noop Aliasing

| Aspect                  | Noop Aliasing         | Runtime Aliasing          |
|-------------------------|-----------------------|---------------------------|
| **Size Relationship**   | Always same size      | Variable sizes            |
| **Address Calculation** | Direct alias          | May use offsets           |
| **Contiguity**          | Not required          | Required for concat/split |
| **ONNX Attributes**     | Not used              | Critical for offsets      |
| **Memory Savings**      | 100% (complete alias) | Varies by operation       |
| **Complexity**          | Simple                | Moderate to complex       |
| **Examples**            | Reshape, Squeeze      | Slice, Concat, Split      |

### When to Use Each

**Use Noop Aliasing** when:

- Operation is pure metadata transformation
- Input and output sizes are identical
- No offset calculations needed

**Use Runtime Aliasing** when:

- Operation behavior determined by ONNX attributes
- Sizes differ but relationship is known
- Contiguous layout can be exploited

**Combine Both** when:

- Runtime op followed by noop op
- Example: `slice_runtime` → `reshape_noop`
- Maximum memory savings achieved

---

## Usage and Integration

### Allocator Integration

The multi-bin allocator needs to:

1. **Detect runtime operations** during scheduling
2. **Read ONNX attributes** for offset calculations
3. **Enforce contiguity** for concat/split operations
4. **Calculate addresses** based on operation type
5. **Track aliases** for proper deallocation

### Implementation Checklist

```python
# 1. Detect runtime operation
if Builder.is_op_runtime(current_op.type):

    # 2. Route to appropriate handler
    if current_op.type == "gather_runtime":
        # Gather on N dimension (axis=0)
        success = allocate_gather_runtime(allocation, current_op)

    elif current_op.type == "slice_runtime":
        # Read ONNX attribute for N dimension start
        start_n = get_attribute(current_op, "starts")[0]
        # Calculate byte offset: start_n × H × W × padded_C × sizeof(dtype)
        success = allocate_slice_runtime(allocation, current_op, start_n)

    elif current_op.type == "concat_runtime":
        # Concatenate on N dimension (axis=0)
        # All inputs must have same H, W, C
        success = allocate_concat_runtime(allocation, current_op)

    elif current_op.type == "split_runtime":
        # Split on N dimension (axis=0)
        # Read split attribute: [N_0, N_1, ..., N_k]
        split_sizes = get_attribute(current_op, "split")
        success = allocate_split_runtime(allocation, current_op, split_sizes)
```

### Error Handling

Common issues and solutions:

| Issue                          | Cause                                  | Solution                                       |
|--------------------------------|----------------------------------------|------------------------------------------------|
| **Size mismatch in split**     | Split values don't sum to input N      | Validate `∑(split[i]) == N_input`              |
| **Non-contiguous concat**      | Previous allocations fragmented memory | Pre-reserve contiguous region                  |
| **Invalid offset in slice**    | `starts[0]` exceeds input N dimension  | Validate ONNX attributes                       |
| **Mismatched H,W,C in concat** | Inputs have different H, W, or C       | Verify all inputs have same H,W,C              |
| **C-padding miscalculation**   | Forgot to use padded C for offsets     | Always use `round_up(C, 64)` for physical size |

---

## Assumptions and Limitations

### Assumptions

1. **NHWC data format**: All tensors use NHWC layout (batch, height, width, channels)
2. **C-dimension padding**: C dimension padded to multiples of 64 for performance
3. **N-dimension operations only**: Runtime operations only operate on axis=0 (N/batch dimension)
4. **ONNX attributes are valid**: Slice offsets, split sizes are correct
5. **Type consistency**: Input and output dtypes match (for offset calculations)
6. **Contiguity available**: Sufficient contiguous memory for concat/split
7. **No intermediate writes**: Runtime ops can safely reuse input buffers
8. **Same H,W,C for concat/split**: All inputs/outputs in concat/split have identical H, W, C dimensions
9. **Gather output size constraint**: For `gather_runtime`, output size must be ≤ input size

### Limitations

1. **Leading dimensions constraint**: Operations on axis=k require all dimensions before k to be unity (1) for memory aliasing
2. **Concat with many inputs** (>20): May require complex allocation strategy
3. **Cross-bin operations**: Runtime ops must stay within single memory bin
4. **Graph complexity**: Deeply nested concat chains increase scheduling difficulty
5. **Gather with repetitions**: Cannot support gather operations where indices cause output size > input size
6. **Fixed C-padding**: All tensors must use the same C-dimension padding strategy (multiples of 64)

### Future Enhancements

- **Relaxed leading dimensions constraint**: Explore optimizations for cases where leading dimensions are small but not unity
- **Relaxed gather constraint**: Support output > input by allocating max(input, output) if needed
- **Dynamic padding**: Support variable C-dimension padding based on hardware capabilities

---

## Summary

Runtime memory aliasing enables sophisticated memory optimizations for NHWC tensor operations:

**Key Concepts:**

- **Data Format**: NHWC (batch, height, width, channels)
- **C-Dimension Padding**: Last dimension padded to multiples of 64
- **Axis Flexibility**: Runtime ops support any axis where all leading dimensions are unity (1), making operations equivalent to axis=0 in memory layout

**Supported Operations:**

- **`gather_runtime`**: Max-size buffer sharing for element gathering
  - Allocation: `max(input_physical_size, output_physical_size)`
  - Use case: Select subset of elements along an axis

- **`slice_runtime`**: Offset-based sub-buffer aliasing for slicing
  - Allocation: `input_physical_size` only
  - Output offset: `start_k × stride_k × sizeof(dtype)` where k is the normalized axis
  - Use case: Extract contiguous range along an axis

- **`concat_runtime`**: Contiguous input layout with output at base
  - Allocation: `∑(physical_size_i)` for all inputs
  - Constraint: All inputs have same dimensions except at concatenation axis
  - Use case: Combine multiple tensors along an axis

- **`split_runtime`**: Contiguous output layout from single input
  - Allocation: `physical_size_input`
  - Split attribute: specifies size along split axis for each output
  - Use case: Partition input into multiple outputs along an axis

**Benefits:**

These strategies, combined with noop aliasing, can significantly reduce memory usage in neural network graphs while maintaining correctness and performance. The key is leveraging the fact that when all leading dimensions are unity, operations on any axis are equivalent to axis=0 in memory layout, allowing efficient buffer reuse with simple offset calculations.
