#### ML Acceleration with Heterogenous computing for big data Physics Philip Harris(MIT)

#### **Fermilab**

Burt Holzman Sergo Jindariani Benjamin Kreis Mia Liu Kevin Pedro Nhan Tran Aristeidis Tsaris

Naif Tarafadar Paul Chow Massachusetts Institute of Technology Jeff Krupa Dylan Rankin Sang Eon Park

WASHINGTON

Scott Hauck Shih-Chieh Hsu Matthew Trahms Dustin Werran

C UNIVERSITY OF ILLINOIS AT CHICAGO

Zhenbin Wu

#### 

Mark Neubauer Markus Crisziani Microsoft

Suffian Khan Brandon Perez Colin Versteeg Ted W. Way Andrew Putnam Kalin Ovatcharov

Vladimir Loncar Jennifer Ngadiuba Maurizio Pierini

COLUMBIA UNIVERSITY

Giusppe Di Guglielmo

#### Beyond the Multicore Era

- Following the breakdown off Dennard scaling
  - Companies focused on multicore development
    - With many cores power limitations come up again





Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten Dotted line extrapolations by C. Moore

#### What becomes of the Post Multicore Era Processors become specialized

https://www.cc.gatech.edu/~hadi/doc/paper/2011-isca-dark\_silicon.pdf

#### **Understanding Time Scales**



#### **Understanding Time Scales**



#### **Understanding Time Scales**





### How do we process data?



Conventional computing Industry is building a lot of tools Our input can drive innovation

Custom ASIC+FPGA System ML needs to be done in <1µs Requires a rethink of ML processing



## Hidden gems?

There is a plethora of physics that we throw out



Higgs boson right on the cusp of being thrown out

### The dream

- At the moment:
  - We only get a full data of one in 100,000 collisions
  - There is interesting physics that we have to throw away

- We would like to analyze every collision at the LHC
  - To deal with this we need to increase our throughput
  - Ultimately this means going to 100s of Tb/s

# The Challenge

- We are upgrading the system
  - Our event size will be 10 times larger

End of Dennard Scaling is about to hit us hard

- And we have to take data at 5x the rate
  - Need this just to preserve our existing physics
- 10s of years of processing without modifying system





## Deep Learning in HEP





With rise of deep learning we are quickly coming up with new ways to interpret the data and improve our Physics data analysis

### Deep Learning L1 Trigger

- Have at MOST 1 $\mu$ s to run an algorithm
  - We aim for algorithms that are in the 100ns range
- Want to make the fastest possible algorithm
- Want to have the smallest initiation interval
  - A collision is every 25ns (40 MHz)
  - We apply algorithms to multiple subsets of total event
    - That means we need applications in every X < 25ns</li>



### Summing Up the Data flow



PYTÖRCH

Support in HLS4ML for HLS tuning Final Product MLPs,CNNs,Binary/Tenary NNs,BDTs,Graph NNs,LSTM/GRUs



## What can we run?



75ns latency new input every 5ns fits in a a conventional FPGA (VU9P)

# Design at the L1

Focus on 3 ways to cut down resources

Is our algorithm overly complex?

#### **Algorithmic Compression**



Are we too precise?

#### Quantization

ap\_fixed<width,integer>
0101.1011101010
integer fractional
width

Does it really need to be this fast?

#### **Reuse Factor**



### Current Status

- Tool: quickly adopted for a number of applications
  - Muon p<sub>T</sub> reconstruction with an NN
  - Autoencoder at Level 1
  - NN based Tau lepton identification
  - Jet substructure at L1
  - •
- Tutorials exsit using AWS f1 FPGA cluster

Targeting LHC Run 3





## Deep Learning in HLT

- Previous systems have been CPU only
  - New systems will likely be heterogeneous (FPGAs/GPUs...)
  - Some parts of reco can still benefit from use of CPU
- With timescales at the level of milliseconds
  - Utilize industry tools to use heterogeneous systems
  - Can consider GPUs for parallelized processing
- ML on the millisecond timescale used by many industries
  - ML inference engines (ML-as-a-service) MLaas



MODULE 5

ML INFER 2

Complicated scheme of modules 

**Event Setup** 

Database

- While some parts are parallelizeable
- Collision level analysis built in by construction (Batch 1)

MODULE 4



## Service Options

#### Low latency Triggering

Larger latency but still large throughput (future slides)





When latency not critical element : can go off-site to the cloud For timescales in <100ms, this is not an option



# Offloading to Hardware

• To run these algorithms within our software



- Our Strategy
  - Pick benchmark ML examples+put them on FPGAs/GPUs
  - Observe what level speed up we get over CPUs and how



### Examples

#### Hcal Reco



Energy reconstruction of Hadronic showers Simple energy regression 16000 times per collision Batch N per particles



Top quark identification Here we use Resnet50 as benchmark

Complicated identification Many inputs 1-2 times per collision

**Batch 1 per Event** 



### Performance

#### Hcal Reco

| Algo               | Per Event |
|--------------------|-----------|
| Old CPU            | 50ms      |
| NN CPU             | 15ms      |
| NN GPU(1080 Ti)    | 3ms       |
| NN FPGA            | 2ms       |
| +Off Machine       | +10ms     |
| +Offsite           | +40ms     |
| throughput+offsite | 3ms       |

#### Image Top tagger

| Algo               | Per Event |  |
|--------------------|-----------|--|
| CPU                | 1.75s     |  |
| GPU Batch 1        | 7ms       |  |
| GPU Batch 32       | 2ms       |  |
| FPGA               | 1.7ms     |  |
| +Off Machine       | +10ms     |  |
| +Offsite           | +40ms     |  |
| throughput+offsite | 1.7ms     |  |

#### GPU : Use tensorflow+tensorrt FPGA: HLS4ML+SDAccel

GPU : Use tensorflow+tensorrt **FPGA: Microsoft Azure** 

**Batch 1 per Event** 

No GPU nor FPGA code needed to implement these optimally! **Batch 16000 per particles** 

## **Processing Technology**



With a GPU we can get to FPGA level of throughput, but long latency

Speedups on a single FPGA can serve many different CPU cores

### What have we learned?

• With large speedups we can redesign our system



Process event by event



# Alternative GPU model



#### Full Reconstruction algorithm ported to GPU



## Beyond the LHC

https://fastmachinelearning.org/

- This talk has focused on data reconstruction at the LHC
- Are quickly identifying other cases with the same issues
- Have extended our collaboration to incorporate everybody
  - Inaugural workshop can be found here <a href="https://indico.cern.ch/event/822126/">https://indico.cern.ch/event/822126/</a>
  - You too can join our Fast Machine Learning effort

#### Lets consider a few examples

### Neutrino Event Reconstruction



Reconstruction can be performed with a CNN (Resnet-like)

Future detectors will have to deal with 40 Tb/s of data

They will aim for per-event latency < 2ms to find Supernovae

# Beam Dump Exps.

27



- Very large data rate expts benefit from real time processing
  - A wide variety of beam dump experiments would benefit
  - ML is a great way to preprocess and compress data
  - Allows for fast high rate processing

### Particle Accelerators

- Demands for high speed control of accelerator systems
- Large data rates to monitor and control beam dynamics
- Have had continual success with ML solutions for modeling





### **Gravitational Wave Detection**



Fast identification of gravitational waveforms to signal satellite and other telescopes for astronomical phenomenon multi-messenger astronomy

| Lens type |        |            |                                            |             |  |
|-----------|--------|------------|--------------------------------------------|-------------|--|
| Galaxy    | Quasar | Supernovae |                                            | Ast         |  |
| 1000      | <50    | 2          |                                            | <b>N</b> 31 |  |
| 2,000     | 120    | 5          | LSST will produce over <b>10</b>           |             |  |
| 120,000   | 8,000  | 120        | <b>million</b> transient alerts per night. |             |  |

Survey

Today (all)

DES

LSST

**Euclid** 

Nord+2016; Collett+2015; Gavazzi+2008; Oguri+Marshall, 2010

170,000

| SDSS I-II<br>2000-08                                         | DES<br>2013-18                                            | LSST<br>2022-32                                              |
|--------------------------------------------------------------|-----------------------------------------------------------|--------------------------------------------------------------|
| 2.5-meter mirror                                             | 4-meter                                                   | 8.4 -meter                                                   |
| O(10 <sup>8</sup> ) Galaxies<br>10k sq. deg.<br>0.2 TB/Night | O(10 <sup>8</sup> ) Galaxies<br>5k sq. deg.<br>1 Tb/Night | O(10 <sup>10</sup> ) Galaxies<br>20k sq. deg.<br>20 Tb/Night |
|                                                              |                                                           |                                                              |







Identification of transients require real time processing of all data

## Astrophysics

With LSST in 2022 Astrophysics datasets reach petabyte data scales with large and complicated feature analysis





## Many More

## Everything Getting larger



### Conclusions

- Large scale campaign underway to adopt deep learning everywhere
- Scale of data processing in physics is getting larger
  - With large datasets come huge scientific potential
- Have demonstrated ML+ Heterogeneous computing works
  - Parallelization of NNs and eff of FPGAs give large speedups
    - Can be used both for very low latency systems
    - GPUs are a viable alternative for longer latency

#### Thanks!



## Everything Getting larger



#### **Reconstruction Challenge**



LHC reconstruction involves combining many different detectors in to particles



#### Currently we have 70 collisions lying on top of each other **Event**

In the future will be > 200 collisions

#### **Batch N per particles**

#### **Batch 1 per Event**

#### Data Flow in CMS



High speed Low granularity readout (10µs)

Intermediate speed (100ms) better readout



Full data readout (10s)

Despite the large rate reduction we still store many Petabytes of data

#### Current trends in HEP

- Rapid adoption to improve reconstruction quality
- Effective for newer detectors with large numbers of channels
- Large dedicated effort within HEP community





## Quantization

#### FPGAs support arbitrary precision after pruning hls4ml 1.1 1.0 0.9 before pruning AUC / Expected AUC 0.8 Full 0.7 Pruned 0.6 g tagger q tagger w tagger 0.5 z tagger t tagger 0.4 0101.1011101010 <13,6> <18,6> <23,6> <28,6> <33,6> <3 <8,6> **Fixed-point precision** integer fractional width

<Total bit width, integer bits above decimal>

#### 41 Alternative GPU model



Small event size(compared to CMS/ATLAS)

#### Thanks!









#### **Overall Performance**

- Set the Reuse to 1 (Max speed) Precision to 18,8
  - Benchmarked this on a Virtex 7 (xc7vx690tffg1927-2)



| Reuse = 1 | BRAM | DSP  | FF  | LUT |
|-----------|------|------|-----|-----|
| Total     | 13   | 1116 | 47k | 35k |
| % Usage   | ~0%  | 20%  | 3%  | 5%  |

#### Algorithm Compression

- Compression is a critical aspect to reduce ML
  - Allows us to put much larger networks on an FPGA
- Two key elements which only FPGAs can do
  - Pruning of the weights (removes multiplications)



Usual cross entropy Network Compression

$$L_{i} = -\log\left(\frac{e^{f_{y_{i}}}}{\sum_{j} e^{f_{j}}}\right) + \lambda_{1} | w |$$

- Add a regularization term to the loss function
  - Helps to force weights to smaller values





- Add a regularization term to the loss function
  - Helps to force weights to smaller values



#### Reuse Factor

By gauging pipeline we can adjust resource usage





## Takeaways

- LHC has a unique role to play when processing data
  - With the insanely large data rates
  - Low latency+high throughput demands specialized system
    - Our system will always be ASIC+FPGA-only
    - Working to bring ML and complex algorithms to the system
- As part of this work we developed HLS4ML
  - Quickly becoming a staple for L1 trigger development

#### <sup>49</sup> (500ms) High Level Trigger

100 kHz of collisions in



- 1kHz of collisions out
- <500ms to analyze collision
- Currently
  - A local computing cluster
  - System is all CPUs
- Experiments are considering GPU/CPU system for 2022

## How Fast is It?



With an FPGA can get 1.7ms inference time at batch 1 With a GPU can get 2ms/img time at batch 70



#### • GPU as a service

- Using tensor-rt-server
  - Industry standard
- Latency : 16ms

| Algo            | Per Event    | +On-site aaS |                             |
|-----------------|--------------|--------------|-----------------------------|
| Old             | 50ms         | N/A          |                             |
| NN CPU          | 15ms         | N/A          |                             |
| NN GPU(1080 Ti) | 3ms (prelim) | 16ms         | ► 8ms/event<br>w/concurrent |
| NN FPGA         | 2ms          | TBD(<10ms)   | calls                       |

# Benchmark #1

- FPGA as a service
  - Numbers TBD (<10ms)
    - Using Galapagos Naif Tarafdar+Paul Chow
      - Heterogenous middleware





# Benchmark #2

• Three Options considered : all from computer in same cluster

| GPU as a service                                 |             |                                | Azure Cluster |                                          | Microsoft Databox Edge |  |  |
|--------------------------------------------------|-------------|--------------------------------|---------------|------------------------------------------|------------------------|--|--|
| From local CPU<br>to GPU service                 |             | From local CPU<br>to Brainwave |               | From local CPU<br>to FPGA system at FNAL |                        |  |  |
| Batch 1 latency: 23ms<br>Batch 32 latency: 230ms |             | Batch 1 lat                    | ency: 15ms    | Batch 1 latency: 20ms                    |                        |  |  |
|                                                  | Algo        | Per Event                      | +On-site aaS  |                                          |                        |  |  |
|                                                  | CPU         | 1.75s                          | N/A           |                                          |                        |  |  |
|                                                  | GPU Batch 1 | 7ms                            | 23ms          |                                          |                        |  |  |

| 1.75s | N/A        |  |
|-------|------------|--|
| 7ms   | 23ms       |  |
| 2ms   | 230ms      |  |
| 1.7ms | 15ms       |  |
|       | 7ms<br>2ms |  |





# Throughput

• Despite the longer latency we can have one node serve many



- With this setup how many nodes until system has to throttle down
- Bottlenecks can come from network, not just service



## Benchmark #1

- Throughput is driven by the actual minimum latency of algo
  - For FPGA algo latency is  $0.08ms \rightarrow working$  to get there
- Cloud have to deal with additional slow down from networking



| Algo            | Per Event    | +On-site aaS | +Cloud aaS | Ping | On/Cloud put |
|-----------------|--------------|--------------|------------|------|--------------|
| Old             | 50ms         | N/A          | N/A        | N/A  | N/A          |
| NN CPU          | 15ms         | N/A          | N/A        | N/A  | N/A          |
| NN GPU(1080 Ti) | 3ms (prelim) | 16ms         | 90ms       | 75ms | 1ms/30ms*    |
| NN FPGA         | 2ms          | TBD(<16ms)   | TBD        | TBD  | >0.1ms       |

#### \*Cloud throughput on GPU still to be scrutinized

# Benchmark #2



| Algo         | Per Event | +On-site aaS | +Cloud aaS | Ping | On/Cloud* put |
|--------------|-----------|--------------|------------|------|---------------|
| CPU          | 1.75s     | N/A          | N/A        | N/A  | N/A           |
| GPU Batch 1  | 7ms       | 23ms         | 97ms       | 75ms | 5ms/20ms*     |
| GPU Batch 32 | 3ms       | 240ms        | 975ms      | 75ms | 8ms/20ms*     |
| FPGA         | 1.7ms     | 15ms         | 60ms       | 25ms | 1.7 ms        |

\*Cloud throughput on GPU still to be scrutinized



#### Idea #2: Services

• To run these algorithms within our software



- SONIC : Services for Optimized Network Inference on Coprocessors
- Strategy
  - Use the same benchmarks as before

- Now wrap these with gRPC protocol between different machines



#### Services Takeaway

- Observe a ~10ms increase in latency when going to a service
  - Have observed large variations across network
  - Maintaining consistent network connection critical for running





# Throughput vs Latency

- Why are we limited to 500ms in latency?
  - 500ms at 100 kHz is 400 GB of data  $\rightarrow$ not that much
  - With some redesign it is possible to increase this limit
    - Just need more disk as a buffer
- We still need to be able to process this data quick
  - That means we need to ensure throughput is high

#### <sup>59</sup> (10s) LHC Computing Grid Running jobs: 244151

1/22/2013\_5:55:18 p.m



US Dept of State Geographer © 2013 Google Data SIO, NOAA, U.S. Navy, NGA, GEBCO Image Landsat



Transfer rate: 40.08 GiB/sec



### Offline Reco

- At the final tier of reconstruction
  - Worldwide grid is roughly 0.75 Million cores 600 PB of data
  - Latency is not a critical limitation
    - Grid will have different technology all over (common protocol?)



# Service Options

#### Low latency Triggering (previous slides)

Larger latency but still large throughput (future slides)



When latency not critical element : can go off-site to the cloud At the offline tier can switch to the cloud  $no \rightarrow$ Heterogeneity now

## Service Options

#### Low latency Triggering (previous slides)





When latency not critical element : can go off-site to the cloud At the offline tier can switch to the cloud  $no \rightarrow$ Heterogeneity now

## Service in Cloud



#### We have already done this with CPUs in the cloud

# Going Beyond

- To be really effective aim for flexibility in NN design
  - Have many different NN architectures to solve many different probs
  - Adapting to industry(Resnet50/Bert/...) not a good option
- Multi-FPGA/.... support
  - Adapting to FPGAs/... will want to avoid CPU altogether
  - Can take advantage of inherent speedups and networking on FPGA
- Throughput adaptations in our computing model
  - Latency limits not critical: can consider alternative computing models





# Benchmark #1 VILLING ALL PROGRAMMABLE



#### **Benchmark #2**





\*Also investigating Xilinx ML suite(see backup) + Intel Open Vino

## Microsoft Brainwave



- Full FPGA interconnected fabric setup-as-a-service
  - Capable of running many different NN architectures
  - Relying on the NPU framework for ML compilation
  - (Very) optimized use of ML on the FPGA



# Benchmark #1





## How Fast is it?

- Unroll network on the FPGA with hls4ml+SDAccel
- Actual network runs in 70ns on an FPGA with II of 5ns
  - For 16000 channels this equates to 80µs total
  - Transfer back and forth on PCIe is 700µs each way
- Current non-ML-based algorithm takes 50ms

| Algo            | Per Event    |
|-----------------|--------------|
| Old             | 50ms         |
| NN CPU          | 15ms         |
| NN GPU(1080 Ti) | 3ms (prelim) |
| NN FPGA         | 2ms          |

Significant speed ups



## Benchmark #2

Resnet50 on Azure FPGA cluster with <2ms/inference

A standard ML benchmark: Top Tagging (resnet50 for physicists) Many different Top Tagging attempts

| Approach                                         | AUC   | Acc.  | 1/eB<br>(@<br>eS=0.3) | Contact                                 | Comments                                                                                                                           |
|--------------------------------------------------|-------|-------|-----------------------|-----------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| LoLa                                             | 0.980 | 0.928 | 680                   | GK /<br>SImon<br>Leiss                  | Preliminary number, based on<br>LoLa                                                                                               |
| LBN                                              | 0.981 | 0.931 | 863                   | Marcel<br>Rieger                        | Preliminary<br>number                                                                                                              |
| CNN                                              | 0.981 | 0.93  | 780                   | David Shih                              | Model from Pulling Out All the<br>Tops with Computer Vision an<br>Deep Learning (1803.00107)                                       |
| P-CNN<br>(1D CNN)                                | 0.980 | 0.930 | 782                   | Huilin Qu,<br>Loukas<br>Gouskos         | Preliminary, use kinematic info<br>only<br>(https://indico.physics.lbl.gov/<br>ndico/event/546/contributions/<br>270/)             |
| 6-body<br>N-subjettiness<br>(+mass and pT)<br>NN | 0.979 | 0.922 | 856                   | Karl<br>Nordstrom                       | Based on 1807.04769 (Report<br>of My Demise Are Greatly<br>Exaggerated: N-subjetting is<br>Taggers Take On Jet Im ges)             |
| 8-body<br>N-subjettiness<br>(+mass and pT)<br>NN | 0.980 | 0.928 | 795                   | Karl<br>Nordstrom                       | Based on 1807.04769 Reports<br>of My Demise Are Greatly<br>Exaggerated: N-surgettiness<br>Taggers Take On the Images)              |
| Linear EFPs                                      | 0.980 | 0.932 | 380                   | Patrick<br>Komiske,<br>Eric<br>Metodiev | d<= 7, chi <= 3 FpPs with FLD.<br>Based on 1712 /7124: Energy<br>Flow Polynomials: A complete<br>linear basis or jet substructure. |
| Particle Flow<br>Network (PFN)                   | 0.982 | 0.932 | 888                   | Patrick<br>Komiske,<br>Eric<br>Metodiev | Median over ten trainings. Based<br>on Tabl 5 in 1810.05165: Energy<br>Flow to tworks: Deep Sets for<br>Particle Jets.             |
| Energy Flow<br>Network (EFN)                     | 0.979 | 0.927 | 619                   | Patrick<br>Komiske,<br>Eric<br>Metodiey | ledian over ten trainings. Based<br>on Table 5 in 1810.05165: Energy<br>Flow Networks: Deep Sets for<br>Particle Jets.             |
| 2D CNN<br>[ResNeXt50]                            |       | 0.000 |                       | Huilin Qu,<br>Luukas<br>Goulkos         | Preliminary from<br>indico.cern.ch/event/745718/cont<br>butions/3202526                                                            |
| DGCNN                                            | 0.984 | 0.937 | 1160                  | Huiloi Qu,<br>Lukas<br>Gouskos          | Preliminary from<br>indico.cern.ch/event/745718/cont<br>butions/3202526                                                            |

Worlds Best Tagger: AUC=98.4% acc.=93.7% 1/ε<sub>B</sub> = 1160 Our Tagger: AUC=98.3% acc.=93.5% 1/ε<sub>B</sub> = 1000 100 Floating point: AUC = 98.0%, acc. = 90.1%,  $1/\epsilon_B = 671$ Quant.: AUC = 97.5%, acc. = 84.1%,  $1/\epsilon_B$  = 415 Quant., f.t.: AUC = 98.2%, acc. = 93.0%,  $1/\epsilon_B = 971$ Brainwave: AUC = 98.2%, acc. = 92.6%,  $1/\epsilon_B = 935$ Background efficiency 10<sup>-1</sup> 10<sup>-3</sup> Brainwave, f.t.: AUC = 98.3%, acc. = 93.5%,  $1/\epsilon_B = 1000$ This tagger is state of the ar  $10^{-2}$ Retraining w/Brainwave fixed precision  $10^{-4}$ 0.2 0.0 0.6 0.8 1.0 0.4 Signal efficiency



# Accelerators Takeaway

- FPGAs and GPUs both work FPGAs better(low batch)/as good
- Benchmark #1
  - Latency lowest on FPGA despite a large batch process
  - Limited by I/O considerations with PCIe
- Benchmark #2
  - FPGA dominates at batch 1
  - With large throughput GPU can start to compete





# Future Strategies

Incorporating Heterogenous systems(GPU/FGPA)

|                                                  | Idea #0 Port<br>Existing Algos                              | Idea #3 Upgrade to<br>ML Algos                                |  |
|--------------------------------------------------|-------------------------------------------------------------|---------------------------------------------------------------|--|
| ldea #1<br>Investigate<br>onboard<br>GPU/FPGA    | Rewrite all of<br>our code in<br>CUDA/Kokkos<br>HLS/RTL/??? | Tools exist<br>TF/Pytorch/TRT<br>Xilinx ML Suite<br>Brainwave |  |
| Idea #2<br>Outsource<br>GPU/FPGA<br>to a service | Write specialized interface                                 | Tools exist:<br>TRT-server<br>Brainwave<br>and in cloud!      |  |

## Future Strategies

Incorporating Heterogenous systems(GPU/FGPA)

|                                                  | Idea #0 Port<br>Existing Algos                              | Idea #3 Upgrade to<br>ML Algos                                |  | talk                                       |
|--------------------------------------------------|-------------------------------------------------------------|---------------------------------------------------------------|--|--------------------------------------------|
| ldea #1<br>Investigate<br>onboard<br>GPU/FPGA    | Rewrite all of<br>our code in<br>CUDA/Kokkos<br>HLS/RTL/??? | Tools exist<br>TF/Pytorch/TRT<br>Xilinx ML Suite<br>Brainwave |  | <b>CUS Of this</b> .<br>backup for others) |
| Idea #2<br>Outsource<br>GPU/FPGA<br>to a service | Write specialized interface                                 | Tools exist:<br>TRT-server<br>Brainwave<br>and in cloud!      |  | Foc<br>(see ba                             |

ML is highly parallelizeable  $\rightarrow$  Big speed ups

### Takeaways

- When large speedups are present in overall throughput
  - Where as-a-service starts to really shine
  - Can think about one service for many machines
  - Will take a latency hit in our system from this
    - This is something we can deal with
- Our next step is bringing the studies to scale
  - Can we serve many thousands of processes at once?

### What have we learned?

• With large speedups we can redesign our system



Process event by event



Process (event by event)? outsource to aaS



- A new event every 25ns
- Interconnected FPGAs Optical links between the chips 48-112 Links per chip Links run at 10-25 Gbps Full system is O(1000) FPGAs

# L1 Trigger



- We have at MOST 1µs to run an algorithm
  - We aim for algorithms that are in the 100ns range
- Want to make the fastest possible algorithm
- Want to have the smallest initiation interval
  - We apply algorithms to multiple subsets of total event





- Ping time is 75ms (speed of light google map distance is 32ms)
- To UCSD and back takes ping time + 16ms
- Still working on test with FPGA (soon)

| Algo            | Per Event    | +On-site aaS | +Cloud aaS | Ping |
|-----------------|--------------|--------------|------------|------|
| Old             | 50ms         | N/A          | N/A        | N/A  |
| NN CPU          | 15ms         | N/A          | N/A        | N/A  |
| NN GPU(1080 Ti) | 3ms (prelim) | 16ms         | 90ms       | 75ms |
| NN FPGA         | 2ms          | TBD(<16ms)   | TBD        | TBD  |



Energy



Energy ML at all Tiers will help to recover missing physics



### Benchmark #2



| Algo         | Per Event | +On-site aaS | +Cloud aaS | Ping |
|--------------|-----------|--------------|------------|------|
| CPU          | 1.75s     | N/A          | N/A        | N/A  |
| GPU Batch 1  | 7ms       | 23ms         | 97ms       | 75ms |
| GPU Batch 32 | 2ms       | 240ms        | 975ms      | 75ms |
| FPGA         | 1.7ms     | 15ms         | 60ms       | 25ms |

### In the detector



All reconstruction is separated on an event by event level

A single particle can leave deposit in many detectors

- Each detector deposit a complex and different topology
- Reconstruction of particles/detectors can be parallelized

### **Reconstruction of Objects**



#### Batch 1 Per Event All reconstruction is separated on an event by event level

A single particle can leave deposit in many detectors

• Each detector deposit a complex and different topology

#### **Batch N Per Particle**

• Reconstruction of particles/detectors can be parallelized

### Xilinx ML Suite

• Consider Googlenet example

**Throughput Rate** 

| requests | avg | latency: | 16.855598 ms |
|----------|-----|----------|--------------|
| time     | avg | latency: | 2.07637 ms   |

images / s 300 # Streams Input Rate Max FPGA Throughput





### Alternative GPU Model



Full Reconstruction algorithm ported to GPU

### Alternative GPU Model



### Alternative GPU Model



### Another View of Same

Collision rate is 40 MHz A new collision every 25ns



Latency : 10µs

87

### 40 MHz (10µs)

### Systems



Each Block represents O(30) FPGAs w/50 Tb/s bandwidth 1µs latency



**Targeting Ultra low latency** applications

HLS tuning

**Final Product** 

### 40 MHz (10μs) Example Performance



### 3-Layer NN 75ns latency with an II of 1

Latency (in clocks) gets worse With reuse factor Consistent with sharing resources

Tuneable reuse of DSPs and BRAM to get latency and II in ns timeslaes

### What is a collision?

- LHC collides 60 protons at the same time
  - Eventually will become 200 protons at the same time
  - Collisions occur at 40 MHz
  - Expect roughly 1000(2000) particles per collision now(future)
    - Particles can leave deposits in many detectors
  - Aim to reconstruct aggregate properties of these collisions
- LHC Detector is roughly 100 Million channels
  - After zero suppression we have 8MB per collision

### A More detailed View



## · Data Box Edge: MS Databox Edge

A Microsoft *hardware-as-aservice solution* with an FPGA inside, installed at FNAL

```
iot_service = \
    IotWebservice.deploy_from_image(
        ws,
        iot_service_name,
        Image(ws, image_name),
        deploy_config,
        iothub_compute
        )
```



 Deploy pre-trained NNs using a CLI or a python SDK

```
    Inference from a 
client by sending 
data over gRPC
```

```
client = PredictionClient(
    address = address.fnal.gov, port = 50051,
    use_ssl = False,
    service_name = module_name
    )
result = client.score_numpy_arrays(
    input_map = {'Placeholder:0' : np_array}
    )
```

## Jet Tagger Example

- Distinguish between top quarks and QCD using
   224x224 single-color images
  - Images: collected energy in the η/φ plane (detector coordinates)



#### **Previous inference results**

- On a single CPU: ~500 ms
- On Azure Kubernetes Cloud Service: ~60-80 ms (depending on distance)
- Deployed at Azure Data Center in Viriginia (2018):
   ~10 ms

#### **Using Data Box Edge**

- Docker container directly on DBE: 14 ms ±25
- From LPC: **20 ms** ±*30*
- From laptop at FNAL: **68 ms** ±27
- From LXPLUS @ CERN: **168 ms** ±62



### 40 MHz (10µs)

Have to take a new event every 25ns

Interconnected FPGAs direct optical links between the chips 48-112 Links per chip Links run at 10-25 Gbps

Full system is O(1000) FPGAs

As FPGAs get larger so has the resolution of our detector



# L1 Trigger



### External Work in CMSSW (1)

Setup:

- TBB controls running modules
- Concurrent processing of multiple events
- Separate helper thread to control external
- Can wait until enough work is buffered before running external process

|                                   |                            | -                          |
|-----------------------------------|----------------------------|----------------------------|
| External<br>Controlling<br>Thread |                            |                            |
| Running                           |                            |                            |
| Waiting<br>To Run                 | MODULE<br>A<br>MODULE<br>B | MODULE<br>B<br>MODULE<br>C |
|                                   | Event Loop<br>1            | Event Loop<br>2            |

### External Work in CMSSW (2)

Acquire:

- Module *acquire()* method called
- Pulls data from event
- Copies data to buffer
- Buffer includes callback to start next phase of module running



#### External Work in CMSSW (3)

Work starts:

- External process runs
- Data pulled from buffer
- Next waiting modules can run (concurrently)

| External<br>Controlling<br>Thread | 1               | 2               |
|-----------------------------------|-----------------|-----------------|
| Running                           | MODULE<br>B     | MODULE<br>B     |
| Waiting<br>To Run                 | MODULE          | MODULE          |
|                                   | Event Loop<br>1 | Event Loop<br>2 |

#### External Work in CMSSW (4)

Work finishes:

- Results copied to buffer
- Callback puts module back into queue



### External Work in CMSSW (5)

Produce:

- Module *produce()* method is called
- Pulls results from buffer
- Data used to create objects to put into event



### Sonic and Friends

