## Traineeships in Advanced Computing for High Energy Physics (TAC-HEP)

#### FPGA module training

Week-9

Lecture-16: 25/03/2025





### Content



#### So far

- HLS Pragmas:
  - Interface
  - Array Partition
  - Array reshape
  - Pipeline

### **Today**

- HLS Pragmas:
  - Dataflow
  - Latency
  - Allocation



### #pragma HLS Pipeline

https://docs.amd.com/r/en-US/ug1399-vitis-hls/pragma-HLS-pipeline

### Pragma HLS Pipeline



#### #pragma HLS pipeline II=<int>

- The PIPELINE pragma reduces the initiation interval (II) for a function or loop by allowing the concurrent execution of operations
- A pipelined function or loop can process new inputs every <N> clock cycles
- If HLS can't create a design with the specified II, it issues a warning and creates a design with the lowest possible II



Without Loop pipelining

```
void func(input, output){
...
  for(i=0; i>=N; i++){
#pragma HLS pipeline II=2
    op_read;
    op_compute;
    op_write;
  }
...
}
```



With Loop pipelining



https://docs.amd.com/r/en-US/ug1399-vitis-hls/pragma-HLS-dataflow



#### #pragma HLS dataflow

- Enables task-level pipelining: allow functions and loops to overlap in their operation
  - Increases the concurrency of the RTL implementation & thus the overall throughput of the design
- In the absence of any directives that limit resources (like pragma HLS allocation), HLS seeks to minimize latency & improve concurrency
  - Data dependencies can limit this, hence proper dataflow is needed



Without DATAFLOW pipelining

```
void top(a, b, c, d){
    ...
    func_A(a,b,i1);
    func_B(c,i1,i2);
    func_C(i2,d);

...
    return d;
}
```



With DATAFLOW pipelining



#### #pragma HLS dataflow

- Enables task-level pipelining: allow functions and loops to overlap in their operation
  - Increases the concurrency of the RTL implementation & thus the overall throughput of the design
- In the absence of any directives that limit resources (like pragma HLS allocation), HLS seeks to minimize latency & improve concurrency
  - Data dependencies can limit this, hence proper dataflow is needed

#### **Example**:

- Functions/loops that access arrays must finish all read/write accesses to the arrays before they complete
- Prevent the next function or loop that consumes the data from starting operation
- The DATAFLOW optimization enables the operations in a function or loop to start operation before the previous function or loop completes all its operations



#### #pragma HLS dataflow



Without DATAFLOW pipelining

```
void top(a, b, c, d){
    ...
    func_A(a,b,i1);
    func_B(c,i1,i2);
    func_C(i2,d);

...
    return d;
}
```



With DATAFLOW pipelining

X Bypassing tasks

X Feedback between tasks

X Conditional execution of tasks

X Loops with multiple exit conditions

### Pragma HLS Dataflow - Example Task-level pipeline



#### #pragma HLS dataflow



- For † ✓ HLS tool issues a message and does not perform DATAFLOW optimization
  - ✓ Use the STABLE pragma to mark variables within DATAFLOW regions to be stable to avoid concurrent read or write of variables.
  - ✓ No hierarchial implementation

next

TAC-HEP: FPGA training module - Varun Sharma

### Pragma HLS Dataflow - Example



```
minclude "example.h"
void example (
 unsigned int in[N],
  short a,
  short b,
 unsigned int c,
  unsigned int out[N]
  ) {
  unsigned int x, y;
  unsigned int tmp1, tmp2, tmp3;
for Loop: for (unsigned int i=0; i < N; i++) {
       x = in[i]:
        tmp1 = func(1, 2);
       tmp2 = func(2, 3);
       tmp3 = func(1, 4);
       y = a*x + b + squared(c) + tmp1 + tmp2 + tmp3;
       out[i] = y;
unsigned int squared(unsigned int a)
 unsigned int res = 0;
 res = a*a;
  return res;
unsigned int func(short a, short b){
 unsigned int res;
 res= a*a;
 res= res*b*a;
  res = res + 3;
  return res;
```

#pragma HLS dataflow

```
void example (
  unsigned int in[N],
  short a,
  short b.
  unsigned int c.
  unsigned int out[N]
   unsigned int x, y;
   unsigned int tmp1, tmp2, tmp3;
#pragma HLS dataflow
for Loop: for (unsigned int i=0; i < N; i++) {
#pragma HLS Pipeline
        x = in[i];
        tmp1 = func(1, 2);
        tmp2 = func(2, 3);
        tmp3 = func(1, 4);
        v = a*x + b + squared(c) + tmp1 + tmp2 + tmp3;
        out[i] = y;
unsigned int squared(unsigned int a)
  unsigned int res = 0;
  res = a*a;
  return res;
unsigned int func(short a, short b){
  unsigned int res;
  res= a*a;
  res= res*b*a;
  res= res + 3;
  return res;
```



#### #pragma HLS dataflow

#### Without DATAFLOW pipelining

| ming:<br>* Summary:  |                                                                                                           |                  |            |              |            |            |
|----------------------|-----------------------------------------------------------------------------------------------------------|------------------|------------|--------------|------------|------------|
| Clock                | Target                                                                                                    | Estimated        | Uncertaint | <del>+</del> |            |            |
| ap_clk               |                                                                                                           |                  | 3.12 ns    |              |            |            |
| tency:<br>* Summary: |                                                                                                           |                  |            |              |            |            |
| Latency<br>min       | (cycles)  <br>  max                                                                                       | Latency (<br>min |            |              |            |            |
| 121                  | 121                                                                                                       | 3.025 us         | 3.025 us   | 121          | 121        | none       |
|                      | * Summary:<br>++<br>  Clock  <br>++<br> ap_clk  <br>++<br>tency:<br>* Summary:<br>+<br>  Latency<br>  min | * Summary:<br>+  | * Summary: | * Summary:+  | * Summary: | * Summary: |

#### With DATAFLOW pipelining



#### Latency:

\* Summary:

| +- | Latency | (cycles) | Latency<br>min | +<br>(absolute)<br>  max<br>+ | Interv<br>  min   m | /al  <br>nax | Pipeline<br>Type | • |
|----|---------|----------|----------------|-------------------------------|---------------------|--------------|------------------|---|
| İ  | 62      | 62       |                | 1.550 us                      |                     |              |                  |   |



#### #pragma HLS dataflow

#### Without DATAFLOW pipelining

| = Utilization Estimat | es       |        |         |         |      |
|-----------------------|----------|--------|---------|---------|------|
| Summary:              |          |        |         |         | ===  |
| Name                  | BRAM_18K | DSP48E | FF      | LUT     | URAM |
| DSP                   | -        | <br> - | <br>  - | <br>  - |      |
| Expression            | -        | 5      | 0       | 154     | _    |
| FIFO<br>Instance      |          | -      |         |         | _    |
| Memory                | i -i     | _      | _       | _       | _    |
| Multiplexer           | i –i     | -i     | i – i   | 30      | _    |
| Register<br>          | -        | -      | 117     | -       |      |
| Total                 | 0        | 5      | 117     | 184     | 0    |
| Available SLR         | 1440     | 2280   | 788160  | 394080  | 320  |
| Utilization SLR (%)   | 0        | ~0     | ~0      | ~0      | 0    |
| Available             | 4320     | 6840   | 2364480 | 1182240 | 960  |
| Utilization (%)       | 0        | ~0     | ~0      | ~0      | 0    |
|                       | +        |        |         |         |      |

#### With DATAFLOW pipelining

| Summary:            |           |            |         |         |      |
|---------------------|-----------|------------|---------|---------|------|
| Name                | BRAM_18K  | DSP48E     | FF      | LUT     | URAI |
| <br>DSP             | -         | <br> -     | <br> -  | <br> -  |      |
| Expression          | ļ - ļ     | -!         | - !     | -!      |      |
| FIFO                | -         | -1         | -       | -       |      |
| Instance<br>Memory  | - <br>  - | 5  <br>_ I | 115     | 214     |      |
| Multiplexer         | -         | _ i        | _       | _       |      |
| Register            | i -i      | -i         | -i      | - i     |      |
| <br>Total           | 0         | 5          | 115     | 214     |      |
| Available SLR       | 1440      | 2280       | 788160  | 394080  | 32   |
| Jtilization SLR (%) | 0         | ~0         | ~0      | ~0      |      |
| <br>Available       | 4320      | 6840       | 2364480 | 1182240 | 96   |
| <br>Utilization (%) | ++<br>  0 | +<br>  0~  | +<br>~0 | +<br>~0 |      |

### Pragma HLS allocation

- Specifies instance restrictions to limit resource allocation in the implemented kernel
- Defines & can limit the number of RTL instances and hardware resources used to implement specific functions, loops, operations or cores
- Example: c-source code has 4 instances of a function my\_func
  - ALLOCATION pragma can ensure that there is only one instance of of my\_func
  - All 4 instances are implemented using the same RTL block
    - Reduces resource used by function but may impact performance
- Operations: additions, multiplications, array reads, & writes can be limited by ALLOCATION pragma

### Pragma HLS allocation - Syntax

Kernel Optimization



#### #pragma HLS allocation instances=<list> limit=<value> <type>

- Instance < list > \*: Name of the function, operator, or cores
- limit=<value>\*: Specifies the limit of instances to be used in kernel
- <type>\*: Specifies the allocation applies to a function, an operator or a core (hardware component) used to create the design (such as adder, multiplier, BRAM)
  - <u>Function</u>: allocation applies to the functions listed in the <u>instances</u>=
  - Operation: applies to the operations listed in the instances=
  - Core: applies to the cores

### Pragma HLS allocation - Example

Kernel Optimization



#### #pragma HLS allocation instances=<list> limit=<value> <type>

Example 1: Limits the number of instances of my\_func in the RTL for hardware kernel to 1

```
void top { a, b, c, d} {
#pragma HLS ALLOCATION instances=my_func limit=1 function
    ...
    my_func(a,b); //my_func_1
    my_func(a,c); //my_func_2
    my_func(a,d); //my_func_3
    ...
}
```

Example 2: Limits the number of multiplier operation used in the implementation of the function <a href="mailto:my\_func">my\_func</a> to 1

- Limit does NOT apply outside the function
- Alternatively, inline the sub-function can also do similar job

```
void my_func(data_t angle) {
#pragma HLS allocation instances=mul limit=1 operation
...
}
```

### Example



#### #pragma HLS allocation instances=<list> limit=<value> <type>

```
minclude "example.h"
void example (
 unsigned int in[N],
  short a,
  short b,
 unsigned int c,
 unsigned int out[N]
  ) {
  unsigned int x, y;
  unsigned int tmp1, tmp2, tmp3;
for_Loop: for (unsigned int i=0 ; i < N; i++) {</pre>
#pragma HLS allocation instances=func limit=1 function
        x = in[i];
        tmp1 = func(1, 2);
        tmp2 = func(2, 3);
        tmp3 = func(1, 4);
        y = a*x + b + squared(c) + tmp1 + tmp2 + tmp3;
        out[i] = y;
unsigned int squared(unsigned int a)
  unsigned int res = 0;
  res = a*a;
  return res;
```

### Pragma HLS allocation



#### Timing:

| * Summary: |        |                 |                 |
|------------|--------|-----------------|-----------------|
| Clock      | Target | <br>  Estimated | <br>Uncertainty |
| •          |        | •               | 3.12 ns         |

#### Latency:

\* Summary:

|  | min | max | min | (absolute)  <br>  max | min | max | Туре |
|--|-----|-----|-----|-----------------------|-----|-----|------|
|  | 121 |     |     | 3.025 us              |     |     | •    |

| = Utilization Estimat | es            |          |         |             |      |
|-----------------------|---------------|----------|---------|-------------|------|
| Summary:              |               |          |         |             |      |
| Name                  | BRAM_18K      | DSP48E   | FF      | LUT         | URAM |
| DSP                   | - <del></del> | +<br> -  |         | <br> -      |      |
| Expression            | -             | 5        | 0       | 169         | _    |
| FIFO                  | -             | -        | -       | -1          | _    |
| Instance              | -             | -        | -1      | -           | -    |
| Memory                | -             | -        | -1      | -           | -    |
| Multiplexer           | -             | -        | -1      | 30          | _    |
| Register<br>          | -             | -        | 85      | -<br>       |      |
| Total                 | 0             | 5        | 85      | 199         | 0    |
| Available SLR         | 1440          | 2280     | 788160  | 394080 <br> | 320  |
| Utilization SLR (%)   | 0             | ~0       | ~0      | ~0          | 0    |
| Available             | 4320          | <br>6840 | 2364480 | <br>1182240 | 960  |
| <br>Utilization (%)   | -++<br>  0    | +<br>~0  | ~0      | ~0          | e    |

### Pragma HLS Latency



#### #pragma HLS latency min=<int> max=<int>

- Specifies a minimum or maximum latency value, or both, for the completion of functions, loops, and regions
  - min=<int>: minimum latency for the function, loop, or region of code
  - max=<int>: maximum latency for the function, loop, or region of code
- Latency: # of CLK cycles required to produce an output
- Function latency: # of CLK cycles required for the function to compute all output values and return
- Loop latency: # of CLK cycles to execute all iterations of the loop

### Pragma HLS Latency



#### #pragma HLS latency min=<int> max=<int>

- HLS always tries to minimize latency in the design
- When LATENCY pragma is specified
  - Min < Latency < Max: Constraint is satisfied, No further optimization</li>
  - Latency < min: It extends latency to the specified value, potentially increasing sharing</li>
  - Latency > max: Increases effort to achieve the constraints
    - Still unsuccessful: issue a warning & produce design with the smallest achievable latency in excess of maximum

### Pragma HLS Latency - Example

**Kernel Optimization** 



#### #pragma HLS latency min=<int> max=<int>

**Example-1:** Function foo is specified to have a minimum latency of 4 and a maximum latency of 8

**Example-2:** loop\_1 is specified to have a maximum latency of 12

**Example-3:** Creates a code region and groups signals that need to change in the same clock cycle by specifying zero latency

```
int foo(char x, char a, char b, char c) {
  #pragma HLS latency min=4 max=8
  char y;
  y = x*a+b+c;
  return y
}
```

```
void foo (num_samples, ...) {
  int i;
  ...
  loop_1: for(i=0;i< num_samples;i++) {
  #pragma HLS latency max=12
   ...
  result = a + b;
  }
}</pre>
```

```
// create a region { } with a latency = 0
{
    #pragma HLS LATENCY max=0 min=0
    *data = 0xFF;
    *data_vld = 1;
}
```

### Pragma HLS Latency - Example



```
void example (
 unsigned int in[N],
  short a,
  short b,
  unsigned int c,
 unsigned int out[N]
  ) {
  unsigned int x, y;
  unsigned int tmp1, tmp2, tmp3;
for_Loop: for (unsigned int i=0 ; i < N; i++) {</pre>
#pragma HLS latency min=4
        x = in[i];
        tmp1 = func(1, 2);
        tmp2 = func(2, 3);
        tmp3 = func(1, 4);
        y = a*x + b + squared(c) + tmp1 + tmp2 + tmp3;
        out[i] = y;
```

### Pragma HLS Latency - Results



#### Timing:

\* Summary:

| Clock  | Target   | Estimated | +<br>Uncertainty |
|--------|----------|-----------|------------------|
| ap_clk | 25.00 ns | 7.401 ns  | 3.12 ns          |

#### Latency:

\* Summary:

|  | Latency<br>min | (cycles)  <br>max | Latency<br>min | (absolute)<br>  max |     | erval<br>max | Pipeline <br>Type |
|--|----------------|-------------------|----------------|---------------------|-----|--------------|-------------------|
|  | 301            | 301               | 7.525 us       | 7.525 us            | 301 | 301          | none              |

+ Detail:

\* Instance:

N/A

\* Loop:

| Loop Name  | Latency<br>min | (cycles)  <br>  max |   | Initiation<br>achieved |   |    | Pipelined |
|------------|----------------|---------------------|---|------------------------|---|----|-----------|
| - for_Loop | 300            | 300                 | 5 | -                      | _ | 60 | no        |

| Summary: |  |
|----------|--|
|----------|--|

| Name                | BRAM_18K    | DSP48E | FF      | LUT     | URAM |
|---------------------|-------------|--------|---------|---------|------|
| DSP                 | +<br>  -    |        |         | <br>  - |      |
| Expression          | I -I        | 5      | 0       | 154     | -1   |
| FIFO                | -           | -      | -       | -       | -    |
| Instance            | -           | -      | -       | -       | -    |
| Memory              | -           | -      | -       | -       | -    |
| Multiplexer         | -           | -      | -       | 47      | -    |
| Register            | -           | -      | 120     | -       | -    |
| Total               | 0           | 5      | 120     | 201     | 0    |
| Available SLR       | 1440        | 2280   | 788160  | 394080  | 320  |
| Utilization SLR (%) | 0           | ~0     | ~0      | ~0      | 0    |
| Available           | 4320        | 6840   | 2364480 | 1182240 | 960  |
| Utilization (%)     | 0           | ~0     | ~0      | ~0      | 0    |
|                     | <del></del> |        |         |         | +    |



### Reminder: Assignments



- Assignment-1 (13-02-2025)
- Assignment-2 (18-02-2025)
- Assignment-3 (27-02-2025)
- Assignment-4 (18-03-2025)
- Assignment-5 (18-03-2025)

Uploaded to cernbox: <a href="https://cernbox.cern.ch/s/gmUqRDHTxDLqx4M">https://cernbox.cern.ch/s/gmUqRDHTxDLqx4M</a>

Send via email: varun.sharma@cern.ch

Submit in 2 weeks from date of assignment



# **IAC-HEP 2025**

### Questions?

#### Acknowledgements:

- https://docs.amd.com/r/en-US/ug1399-vitis-hls/HLS-Pragmas
- ug871-vivado-high-level-synthesis-tutorial.pdf

### List of Available Pragmas



| Туре 💠              | Attributes 💠                                                                                                                                                                                                                                                                           |
|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Kernel Optimization | <ul> <li>pragma HLS aggregate</li> <li>pragma HLS disaggregate</li> <li>pragma HLS expression_balance</li> <li>pragma HLS latency</li> <li>pragma HLS performance</li> <li>pragma HLS protocol</li> <li>pragma HLS reset</li> <li>pragma HLS top</li> <li>pragma HLS stable</li> </ul> |
| Function Inlining   | pragma HLS inline                                                                                                                                                                                                                                                                      |
| Interface Synthesis | <ul><li>pragma HLS interface</li><li>pragma HLS stream</li></ul>                                                                                                                                                                                                                       |
| Task-level Pipeline | <ul><li>pragma HLS dataflow</li><li>pragma HLS stream</li></ul>                                                                                                                                                                                                                        |
| Pipeline            | <ul><li>pragma HLS pipeline</li><li>pragma HLS occurrence</li></ul>                                                                                                                                                                                                                    |

| Loop Unrolling       | <ul><li>pragma HLS unroll</li><li>pragma HLS dependence</li></ul>                                                                               |
|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|
| Loop Optimization    | <ul><li>pragma HLS loop_flatten</li><li>pragma HLS loop_merge</li><li>pragma HLS loop_tripcount</li></ul>                                       |
| Array Optimization   | <ul><li>pragma HLS array_partition</li><li>pragma HLS array_reshape</li></ul>                                                                   |
| Structure Packing    | <ul><li>pragma HLS aggregate</li><li>pragma HLS dataflow</li></ul>                                                                              |
| Resource Utilization | <ul> <li>pragma HLS allocation</li> <li>pragma HLS bind_op</li> <li>pragma HLS bind_storage</li> <li>pragma HLS function_instantiate</li> </ul> |

### Reminder: HLS Setup



- ssh <username>@cmstrigger02-via-login -L5901:localhost:5901
  - Or whatever: 1 display number

• Sometimes you may need to run vncserver -localhost -geometry

1024x768 again to start new vnc server

- Connect to VNC server (remote desktop) client
- Open terminal
  - source /opt/Xilinx/Vivado/2020.1/settings64.sh
  - cd /scratch/`whoami`
  - vivado hls



- Source /opt/Xilinx/Vitis/2020.1/settings64.sh
- Cd /scratch/`whoami`
- vitis\_hls



### Jargons



- ICs Integrated chip: assembly of hundreds of millions of transistors on a minor chip
- PCB: Printed Circuit Board
- LUT Look Up Table aka 'logic' generic functions on small bitwidth inputs. Combine many to build the algorithm
- FF Flip Flops control the flow of data with the clock pulse. Used to build the pipeline and achieve high throughput
- DSP Digital Signal Processor performs multiplication and other arithmetic in the FPGA
- **BRAM Block RAM** hardened RAM resource. More efficient memories than using LUTs for more than a few elements
- PCIe or PCI-E Peripheral Component Interconnect Express: is a serial expansion bus standard for connecting a computer to one or more peripheral devices
- **InfiniBand** is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency
- **HLS** High Level Synthesis compiler for C, C++, SystemC into FPGA IP cores
- **HDL** Hardware Description Language low level language for describing circuits
- RTL Register Transfer Level the very low level description of the function and connection of logic gates
- **FIFO** First In First Out memory
- Latency time between starting processing and receiving the result
  - Measured in clock cycles or seconds
- II Initiation Interval time from accepting first input to accepting next input

