# Irigger, DAQ and **Machine Learning CPAD 2019**

Dec 10, 2019

**Verena Martinez Outschoorn and Isobel Ojalvo** 

### DAQ Concepts Edge MI

Train & Talk by R. Herbst Test Data Sets Weight & Caffe/Tensorflow train and Laver Definition test software **Bias Values** CNN Config Synthesis / Place & Route **FPGA** Record (VHDL)

Framework to provide a configurable VHDL based **inference engine** 

- Layer Types supported: Convolution, Pool & Full

Developed as a proof of concept but applicable for many HEP experiments

developed for Linac Coherent Light Source II



### DAQ Concepts Edge ML

#### Typical NN operations:



Device# of DSPsKintex-7 325T840Virtex-7 690T3600Kintex UltraScale KU1155500Virtex UltraScale+ VU9P6800

#### + DNN

- Including support for large layers
- Binary and Ternary DNN
  - Low precision (1 or 2 bit) weights performance
  - Implemented in LUTs

- Basic DSP48E1 Slice functionality
  - Conv1D and Conv2D (small)
    - Large Convs and Binary/Ternary coming soon
  - Other features
    - Batch normalization
    - Various activation functions
    - Tools for comparing C and RTL simulation results 3

#### Talk by S. Jindariani

20.

### Machine Learning based algorithm for reconstructing prompt and displaced muons at Level-1 in CMS detector



At CMS L1 muon transverse momentum assignment has used ML for inference since LHC Run-1

- Traditionally Used LUTs
- NN Inference also should be possible!
  - Trade off in FPGA resource usage

For Phase-2 the EMTF algorithms will evolve to incorporate new detectors, pile up, maintain efficiency, also incorporate displaced Muon ID

#### **HLS estimates**

Displaced EMTF++: NN performance

| == Utilization Estimates                                                    |                                    |                                    |                                        |                                         |         |  |
|-----------------------------------------------------------------------------|------------------------------------|------------------------------------|----------------------------------------|-----------------------------------------|---------|--|
| * Summary:<br>+                                                             | ++                                 | +                                  | +                                      | +                                       | +       |  |
| Name                                                                        | BRAM_18K                           | DSP48E                             | FF                                     | LUT                                     | URAM    |  |
| IDSP<br>Expression<br>FIFO<br>Instance<br>Memory<br>Multiplexer<br>Register | - <br>  - <br>  39 <br>  - <br>  - | - <br>- <br>2420 <br>- <br>- <br>- | - <br>0 <br>- <br>69109 <br>- <br>4280 | -<br>6<br>-<br>90580<br>-<br>1404<br>32 | -       |  |
| Total                                                                       | 39                                 | 2420                               | 73389                                  | 92022                                   | 61      |  |
| Available                                                                   | 4320                               | 6840                               | 2364480                                | 1182240                                 | 960     |  |
| Utilization (%)                                                             | ~0                                 | 35                                 | 3                                      | 7                                       | θ <br>+ |  |

#### Talk by J. F. Low



APd board being developed



Looking into the Phase-2 APd board <sup>[3]</sup> with Virtex US+ VU9P FPGA, which has 3X more LUT & FF, and 2X more DSP.

NN should comfortably fit in the VU9P (DSP usage is 35%)

<mark>32 clk @ 333 MHz ≈ </mark>100 ns latency



σ

┯┥

0

Ñ

Ο

### **Detect New Physics with Deep Learning**

ReLU

#### Talk by Z. Wu

# 201 10 Dec CPAD Trigger/DAQ/MI

σ

### Example AE Model

- Train with simulated ZeroBias event at 200 pileup
- Use simulated Puppi Jet/MET/MHT inputs (18 inputs) with preprocessing
- Activation function: ReLU
- Loss function: L1Loss
- $\ell(x,y) = L = \{l_1, \dots, l_N\}^{ op}, \quad l_n = |x_n y_n|,$
- Training validation ratio: 0.8
- Number of epochs: 100-200 epochs
- Number of layers: 8 layers
- Model is designed with simplicity for firmware implementation and ٠ resource/latency requirement

#### Traditional Workflow of Searches





Not to claim a discovery! But to give an idea of what Exotic Signals to integrate into our trigger menus

#### Auto Encoder Workflow of Searches



### Machine Learning-based Trigger for DUNE







Selection/ Classification rame N+2 Module-Level Frame N PA) Event Frame Selection/ Selection/ Classification Classification (per APA)

> 1. Low-level: **CNN-based APA-frame** selection and reweighting

| 2. Module-level: |  |  |
|------------------|--|--|
| APA-frame        |  |  |
| coincidence      |  |  |
| across module    |  |  |
| and              |  |  |
| over 10 seconds  |  |  |

Performance and power analysis of CNN s:

| Platform  | Model | Time   | Power | <b>Energy Efficiency</b> |
|-----------|-------|--------|-------|--------------------------|
|           |       | (s)    | (W)   | (img/s/W)                |
| ARM C-A53 | CNN_s | 0.0855 | 2.871 | 4.074                    |
| FPGA      | CNN_s | 0.0511 | 1.110 | 17.630                   |

\*G. Karagiorgi, Y. Jwa, G. di Guglielmo, L. Carloni; DOI: 10.1109/NYSDS.2019.8909784

20

Ο

### Accelerated Machine Learning Inference as a Service





**Pros:** scalable algorithms scalable to the grid/cloud heterogeneity (mixed hardwares)



**Pros:** less system complexity no network latency

|                        | HCal Reco<br>Network              | Resnet-50 (Top tag)<br>Network              |
|------------------------|-----------------------------------|---------------------------------------------|
| CPU<br>(single-thread) | 67 inf/s                          | 0.6 - 2 img/s<br>(depends on CPU)           |
| GPUaaS w/TensorRT      | <b>333 inf/s</b><br>(batch 16000) | 140 img/s (batch 1)<br>667 img/s (batch 32) |
| FPGA<br>(batch 1)      | <b>500 inf/s</b> (batch 1)        | 660 img/s<br>(Brainwave, aaS)               |

#### Talk by N. Tran

σ

201

#### co-processor aaS



#### co-processor aaS



SONIC Services for Optimized Network Inference on Coprocessors



## **Overview of Trigger & DAQ Systems**

#### Talk by K. Chen



#### **Energy Frontier**



#### **Intensity Frontier**



### **Triggered Readout - CMS**

# Talk by C. Herwig

Combine detailed Calorimeter

8. Much Information with

track trigger at L1,

p<sub>T</sub>>3-4 GeV, Vertices

20

10

S

### Px Consortium

orts in ATCA Processor hardware, firmware

### processors and mezzanine

#### n

- **fle** Pooling of efforts in ATCA Processor hardward software development
- Multiple ATCA processors and mezzanine b
  - Modular design philosophy, emphasis on pla solutions with flexibility and expandability
  - Reusable circuit, firmware and software eler

- The APx Consortium
- Pooling of efforts in ATCA Processor hardware, firmwa and software development
- Multiple ATCA processors and mezzanine board types
- Modular design philosophy, emphasis on platform solutions with flexibility and expandability

'x Con

Reusable circuit, firmware and software elements

Sophisticated a gorithms to combine information from all sub detectors at 40MHz Algorithms with fatency of O(100ns) implemented in FPGAs using ATCA hardware Similar strategy pursued by ALLS





### **Triggered Readout - SBND**

#### Talk by D. Rivera

Trigger decision is critical for LArTPC due to slow drift and high granularity of detectors

Data rates and storage increasingly become an issue



Penn Trigger MTC/A Triggers MTC/A Conf. CRT Triggers Board (PTB) Beam Gate V1730 WR/Timin Threshold Triggers V1730 Readout Triggers Nevis TB I/O V2495 Digital In

— Hardware trigger implemented to decide whether or not the TPC should be read out based on combination of information from several key sources 19

### **Real-Time Reconstruction - LHCb**

#### Talk by D. Craik

Several interesting physics signals are high rate processes at LHCb

- improved sensitivity by accessing event information early on
- LHCb performs analysis in real time
- Data is buffered before final stage of trigger to derive calibrations & alignment
- Perform reconstruction at bunch-crossing rate with same quality as offline for most objects
- Full raw event is no longer stored, reduce load on offline reconstruction

#### Already successfully used for several results & plans for extension for next run



CPAD

### **Real-Time Analysis - CMS**

#### Talk by R. Mommsen

CMS is planning a 40 MHz real-time analysis stream for HL-LHC — Interesting for physics and as diagnostic & monitoring tool



- Gained experience in Run 2, plans for expansion in Run 3
- Successful implementation requires R&D activities on several fronts
  - HW inference engines
  - Stream processing
  - Distributed algorithms (MPI)
- NVRAM latency
- Searchable Feature DB
- Key-value store to assemble and buffer event fragments

<u>1</u>0

### Continuous Readout - MicroBooNE

MicroBooNE's Continuous Readout Stream targets seeks to observe supernova signal

- Reads out data continuously and stores it until external trigger is issued

Supernova Early Warning System (SNEWS)



Talk by I. Ponce

Save data for

2 davs

Save data continuously

for several hours

If SNEWS alert

CPAD

## **Common Challenges and R&D**

**Experiments with large** 

#### Talk by K. Chen



Ethernet Speed 🔵 Speed in Development 🛟 Possible Future Speed

# Exciting developments in Trigger DAQ and ML!

2019

# Thank You!