# Depth-wise Convolution (DWC) Tiler and Scheduler Specification

## 1. Overview
Depthwise convolution processes each input channel independently with its own filter, unlike normal convolution that mixes information across all channels simultaneously, resulting in fewer parameters and lower computational cost for depthwise (often combined with pointwise convolution in depthwise separable convolution), making models faster and more efficient for mobile/edge devices, while standard convolution offers richer feature mixing but is computationally heavier. 

![Standard convolution vs DepthWise convolution](images/Standard-convolution-operation-vs-depthwise-separable-convolution-operation.jpeg)

## 2. Scope
This spec is an outline description of mapping DWC to the AIE4 architecture (tiling/scheduler) and it's limitation. 

## 3. Limitations
- Constraints: Yos % 2 == 0, Xos % 4 == 0, Cos is multiple of 64
- Kernel_gran: 64, 64 //Co_gran, Ci_gran
- Memory_align: 128
- L1 allocation constraint:-  
  - IFM, IFM_PONG and TDM can't allocate in same bank
  - WGT and WGT_PONG can't allocate in same bank
  - OFM and OFM_PONG can't allocate in same bank 
- IFM, WGT and OFM are ALWAYS streaming

## 4. Build flow
![Standard convolution vs DepthWise convolution](images/aie4_op_build_flow.jpg)

## 5. Tiler
- Granularity:- (2, 4, 64) //Xos, Yos, Cos
- Ci_loop is ALWAYS 1
- L1 buffer ping-pong priority  (High to low)
  - IFM
  - WGT 
  - OFM
  - TDM
  - VEC
  - QDQ
- Tiling sorting priority (High to low):-
  - is_input_single_buffered   
  - total_loop_count
  - ofm_subv_total           
  - Yos                      
  - Xos                    
  - Cos

### 5.1 Tiling Algorithm
1. Generate all valid ofm_subvs combinations respecting granularity
2. For each ofm_subv:
   - Calculate ifm_subv using dwc_get_aligned_Xis
   - Compute buffer sizes (IFM, WGT, OFM, QDQ)
   - Attempt L1 allocation with CP-SAT solver
3. Filter splits using dwc_is_split_valid
4. Force Ci_loop = 1 (no accumulation in DWC)
5. Sort by performance metrics

## 6. Scheduler
```
L3 (DDR) ──────> L2 (MemTile 3MB) ──────> L1 (Core ~64KB)
    │                    │                       │
    │                    │                       │
  Shim DMA          Memtile DMA             Core DMA
```
- L2 allocator ALWAYS enable double buffering for ifm, wgt and ofm
- Supported Data stream mode
  - IFM unicast / WGT broadcast mode
  - IFM broadcast / WGT unicast mode
- **TBD** Include diagrams or link to architecture docs.
- **TBD** Loop and phasing logic 
####      L2 Schedule
   **NOT Implemented yet** 
####      L3 Schedule

### 6.1 Scheduling Algorithm
- Create DwcL2MemoryAllocator with double-buffering
- Determine data stream mode (unicast/broadcast) 
  - Either broadcast same IFM to cores processing different Co blocks
  - Or broadcast same WGT to cores processing different spatial tiles
- Generate DMA transfers for L2/L3
- Generate core instructions per tile



## 7. Run-time/Wgt formatting requirments
Endpoints, data formats, etc.

