## Traineeships in Advanced Computing for High Energy Physics (TAC-HEP)

#### **FPGA module training**

<u>Week-10</u>

Lecture-19: 10/04/2025



UNIVERSITY OF WISCONSIN-MADISON

Varun Sharma

University of Wisconsin – Madison, USA



#### <u>So far</u>

- HLS Pragmas:
  - Interface
  - Array Partition
  - Array reshape
  - Pipeline
  - Dataflow
  - Latency
  - Allocation
  - Stable
  - Inline
  - Unroll

### <u>Today</u>

- Unsupported C/C++
   constructs
- Matrix Multiplication
- Examples for HLS Pragmas:
  - Unroll

## TAC-HEP 202

## Different IPs on VU13P







TAC-HEP: FPGA training module - Varun Sharma



# Unsupported C/C++ Constructs

TAC-HEP: FPGA training module - Varun Sharma

4





#### HLS compilers support many C/C++ constructs, but some are not synthesizable.

• Coding changes may be required for successful synthesis and implementation.

#### For a function to be synthesized:

- It must fully contain the design's functionality
- No system calls to the operating system are allowed
- All C/C++ constructs must have fixed or bounded sizes
- The constructs' implementation must be unambiguous





**System calls are not synthesizable** because they interact with the operating system, which is not present in the hardware environment where the synthesized design runs

**Vitis HLS ignores certain system calls** like printf() and fprintf(stdout,) if they only display data and don't affect algorithm execution.

Most system calls (e.g., getc(), time(), sleep()) are not synthesizable and should be removed before synthesis

Vitis HLS defines the <u>SYNTHESIS</u> macro during synthesis.

• This macro can be used to conditionally exclude non-synthesizable code from the design

void hier\_func4(din\_t A, din\_t B, dout\_t \*C, dout\_t \*D)

dint\_t apb, amb; sumsub\_func(&A,&B,&apb,&amb);

```
#ifndef __SYNTHESIS__
FILE *fp1; // The following code is ignored for synthesis
char filename[255];
sprintf(filename,Out_apb_%03d.dat,apb);
fp1=fopen(filename,w);
fprintf(fp1, %d \n, apb);
fclose(fp1);
#endif
shift_func(&apb,&amb,C,D);
```

## Dynamic Memory Usage



- Memory allocation system calls like malloc(), alloc(), and free() rely on OSmanaged resources and runtime behavior
- Such calls cannot be synthesized and must be removed from the design code
- A hardware design must be **fully self-contained**, with all required resources explicitly defined
- Dynamic memory operations must be replaced with equivalent fixed or bounded representations for synthesis

Because the coding changes impact the functionality of the design, AMD does not recommend using the <u>\_\_\_\_\_\_\_SYNTHESIS\_\_</u> macro.



### Example



#include "malloc removed.h" #include <stdlib.h> //#define NO SYNTH

```
dout t malloc removed (din t din[N], dsel t
width)
```

```
#ifdef NO SYNTH
 long long *out accum = malloc
(sizeof(long long));
 int* array local = malloc (64 *
sizeof(int));
```

#### #else

```
long long out_accum;
 long long *out accum = & out accum;
 int array local[64];
 int* array local = & array local[0];
#endif
```

```
int i,j;
LOOP SHIFT: for (i=0; i<N-1; i++) {
if (i<width)
  *(array local+i)=din[i];
else
  *(array local[i])=din[i]>>2;
*out accum=0;
LOOP ACCUM: for (j=0; j<N-1; j++) {
   *out accum += *(array local+j);
return *out accum;
```



- 1. Add the user-defined macro **NO\_SYNTH** to the code and modify the code.
- 2. Enable macro **NO\_SYNTH**, execute the C/C++ simulation, and save the results.
- 3. Disable the macro **NO\_SYNTH**, and execute the C/C++ simulation to verify that the results are identical.
- 4. Perform synthesis with the user-defined macro disabled.

This methodology ensures that the updated code is validated with C/C++ simulation and that the identical code is then synthesized



## **Pointer Limitation**



**X**General pointer casting is not supported by Vitis HLS

int num = 10;

void \*ptr = # // Void pointer pointing to an integer // Cast the void pointer to an integer pointer int \*intPtr = (int \*)ptr;

 $\checkmark$  Pointer arrays are supported  $\checkmark$ Given they points to scalar or an array of scalars X Arrays of pointers can't point to additional pointers

**X** Function pointers are not supported int (\*funcPtr)(int, int);





**X**Recursive functions can't be synthesized (function that can perform multiple recursions)

```
unsigned foo (unsigned n)
{
    if (n == 0 || n == 1) return 1;
    return (foo(n-2) + foo(n-1));
}
```

XTail recursions are also not allowed (finite number of function calls)

```
unsigned foo (unsigned m, unsigned n)
{
    if (m == 0) return n;
    if (n == 0) return m;
    return foo(n, m%n);
}
```



- Many C++ STLs contain function recursion and use dynamic memory allocation
- These can **NOT** be synthesized by Vitis HLS

#### • Solution:

• Create a local function with identical functionality that does not feature recursion, dynamic memory allocation, or dynamic creating and destruction of objects.

#### • Example: std::vector, std::map, std::list, std::sort

## Undefined Behaviors



The C/C++ undefined behaviors is allowed but may lead to a different behavior in simulation and synthesis

for (int i=0; i< N; i++) { int val; //un-initialized value if (i == 0) val = 0; else if (cond) val = 1; // val may have intermediate value here A[i] = val; //undefined behavior val++; // dead code

Behavior between GCC and HLS when compiling code is likely to be different

Lead to a mis-match during RTL/cosimulation

- In GCC compiled for CPU, the value of **val** may be retained across loop iterations, as it could remain in the same register or stack location
- Good Practise:
  - Initialize val at the start of each iteration if this behavior is expected.
  - Move the declaration of **val** above the loop so that its lifetime matches the intended reuse.

Do not expect the compiler to infer a specific defined RTL behavior from undefined C/C++ behavior

## Some common errors/warnings



WARNING: [RTGEN 206-101] Setting dangling out port 'example/A\_WEN\_A' to 0 WARNING: [RTGEN 206-101] Setting dangling out port 'example/A\_Din\_A' to 0

This means HLS generated write-enable (WEN) and data-in (Din) ports for array A, but they are never written to in the design — so those outputs are dangling (unused) and set to 0.

#pragma HLS INTERFACE ap\_const port=A
#pragma HLS INTERFACE ap\_const port=B

These are **read-only and** won't generate write ports (no WEN, Din)

ERROR: [XFORM 203-801] Interface parameter bitwidth 'A.V' (example.cpp:8:1) must be a multiple of 8 for AXI4 master port

AXI4 memory-mapped interfaces require data widths in bytes (multiples of 8 bits)

## Some common errors/warnings



WARNING: [SCHED 204-69] Unable to schedule 'load' operation ('A\_load\_2', example.cpp:22) on array 'A' due to limited memory ports. Please consider using a memory core with more ports or partitioning the array 'A'.

This warning is super common in HLS when multiple accesses happen to the same memory (like arrays A, B, or C) in the same clock cycle, but the default memory core only has one read and one write port

- HLS maps arrays to **block RAMs** (usually single-port or dual-port).
- When you pipeline loops (like with #pragma HLS PIPELINE), multiple operations might try to read/write to the same array at once.
- Since **BRAM has limited ports**, it throws a scheduling warning

Try partitioning array: May get rid of the warning



## Matrix Multiplication

TAC-HEP: FPGA training module - Varun Sharma

## Matrix multiplication

#include <ap\_int.h>
#include <hls\_stream.h>

#include "example.h"

```
// Top-level function for HLS
void example(din_t A[N][N], din_t B[N][N], din_t C[N][N]) {
#pragma HLS INTERFACE m_axi port=A
#pragma HLS INTERFACE m_axi port=B
#pragma HLS INTERFACE m_axi port=C
```

```
// Matrix multiplication
for (size_t i = 0; i < N; i++) {
    for (size_t j = 0; j < N; j++) {
    #pragma HLS PIPELINE II=1
        din_t sum = 0;
        for (size_t k = 0; k < N; k++) {
            sum += A[i][k] * B[k][j];
        }
        C[i][j] = sum;
    }
}</pre>
```







## **UNROLL Example**



#include <ap\_int.h>
#include <hls\_stream.h>

#include "example.h"

```
void example(din_t A[N], din_t B[N], din_t C[N]) {
```

```
for (size_t i = 0; i < N; ++i) {
#pragma HLS UNROLL factor = 4
    C[i] = A[i] + B[i];
    }
}</pre>
```



## Assignment-6



- Use example in slide-3 of lecture 17 to reduce resource utilization
   – specially the DSP usage (<u>https://github.com/varuns23/TAC-HEP-FPGA/tree/main/tutorial/wk9lec17/ex-func</u>)
  - You can use a combination of sub-set of following pragmas:
    - Array Partition
    - Array reshape
    - Pipeline
    - Dataflow
    - Latency
    - Allocation
    - INLINE
  - Objective: To have DSP usage less than 10
- Refer to ex-all folder for example with pragmas

## Reminder: Assignments

- Assignment-1 (13-02-2025)
- Assignment-2 (18-02-2025)
- Assignment-3 (27-02-2025)
- Assignment-4 (18-03-2025)
- Assignment-5 (18-03-2025)

Uploaded to cernbox: https://cernbox.cern.ch/s/gmUqRDHTxDLqx4M

Send via email: varun.sharma@cern.ch

Submit in 2 weeks from date of assignment







Acknowledgements:

- <u>https://docs.amd.com/r/2024.1-English/ug1399-vitis-hls</u>
- ug871-vivado-high-level-synthesis-tutorial.pdf

## List of Available Pragmas

| Туре 🖨              | Attributes                                                                                                                                                                                                                                                                                                       |
|---------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Kernel Optimization | <ul> <li>pragma HLS aggregate</li> <li>pragma HLS alias</li> <li>pragma HLS disaggregate</li> <li>pragma HLS expression_balance</li> <li>pragma HLS latency</li> <li>pragma HLS performance</li> <li>pragma HLS protocol</li> <li>pragma HLS reset</li> <li>pragma HLS top</li> <li>pragma HLS stable</li> </ul> |
| Function Inlining   | pragma HLS inline                                                                                                                                                                                                                                                                                                |
| Interface Synthesis | <ul><li>pragma HLS interface</li><li>pragma HLS stream</li></ul>                                                                                                                                                                                                                                                 |
| Task-level Pipeline | <ul><li>pragma HLS dataflow</li><li>pragma HLS stream</li></ul>                                                                                                                                                                                                                                                  |
| Pipeline            | <ul> <li>pragma HLS pipeline</li> <li>pragma HLS occurrence</li> </ul>                                                                                                                                                                                                                                           |

| Loop Unrolling       | <ul> <li>pragma HLS unroll</li> <li>pragma HLS dependence</li> </ul>                                                                            |
|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|
| Loop Optimization    | <ul> <li>pragma HLS loop_flatten</li> <li>pragma HLS loop_merge</li> <li>pragma HLS loop_tripcount</li> </ul>                                   |
| Array Optimization   | <ul> <li>pragma HLS array_partition</li> <li>pragma HLS array_reshape</li> </ul>                                                                |
| Structure Packing    | <ul> <li>pragma HLS aggregate</li> <li>pragma HLS dataflow</li> </ul>                                                                           |
| Resource Utilization | <ul> <li>pragma HLS allocation</li> <li>pragma HLS bind_op</li> <li>pragma HLS bind_storage</li> <li>pragma HLS function_instantiate</li> </ul> |

## Reminder: HLS Setup

- ssh <username>@cmstrigger02-via-login -L5901:localhost:5901
  - Or whatever: 1 display number
  - Sometimes you may need to run vncserver -localhost -geometry 1024x768 again to start new vnc server
- Connect to VNC server (remote desktop) client
- Open terminal
  - source /opt/Xilinx/Vivado/2020.1/settings64.sh
  - cd /scratch/`whoami`
  - vivado\_hls

#### OR

- Source /opt/Xilinx/Vitis/2020.1/settings64.sh
- Cd /scratch/`whoami`
- vitis\_hls



April 10, 2025

TAC-HEP: FPGA training module - Varun Sharma

24

## Jargons



- ICs Integrated chip: assembly of hundreds of millions of transistors on a minor chip
- **PCB:** Printed Circuit Board
- LUT Look Up Table aka 'logic' generic functions on small bitwidth inputs. Combine many to build the algorithm
- FF Flip Flops control the flow of data with the clock pulse. Used to build the pipeline and achieve high throughput
- DSP Digital Signal Processor performs multiplication and other arithmetic in the FPGA
- BRAM Block RAM hardened RAM resource. More efficient memories than using LUTs for more than a few elements
- PCIe or PCI-E Peripheral Component Interconnect Express: is a serial expansion bus standard for connecting a computer to one or more peripheral devices
- InfiniBand is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency
- HLS High Level Synthesis compiler for C, C++, SystemC into FPGA IP cores
- HDL Hardware Description Language low level language for describing circuits
- RTL Register Transfer Level the very low level description of the function and connection of logic gates
- FIFO First In First Out memory
- Latency time between starting processing and receiving the result
  - Measured in clock cycles or seconds
- II Initiation Interval time from accepting first input to accepting next input





TAC-HEP: FPGA training module - Varun Sharma