

# Machine Learning Techniques to Estimate the Functional Failure Rate of Complex Circuits

<u>Thomas Lange,</u> Aneesh Balakrishnan, Maximilien Glorieux, Dan Alexandrescu and Luca Sterpone

www.rescue-etn.eu

BOSCH















### **Motivation**

### • Due to

- technology scaling,
- Iower supply voltages,
- higher operating frequencies
- ⇒circuits become more vulnerable to transient faults



ELICSIR 2021, 17 - 18 November 2021, T. Lange et al.

Source: http://cccp.eecs.umich.edu/research/reliability.php

### **Motivation**

- Circuits become more vulnerable to transient faults



ELICSIR 2021, 17 - 18 November 2021, T. Lange et al.

3

Source: https://www.explainthatstuff.com/integratedcircuits.html

### **Motivation**

- Circuits become more vulnerable to transient faults
- Complexity of today's circuits is increasing
- Requirements of Functional Safety Standards
  - → failure analysis needs to be performed on applicative level

#### Table 1: ISO 26262 Target Values for Quantitative Evaluation Metrics

|                           | ASIL-B    | ASIL-C    | ASIL-D   |
|---------------------------|-----------|-----------|----------|
| Random HW Faults          | ≤ 100 FIT | ≤ 100 FIT | ≤ 10 FIT |
| Single Point Fault Metric | ≥ 90%     | ≥ 97%     | ≥ 99%    |
| Latent Fault Metric       | ≥ 60%     | ≥ 80%     | ≥ 90%    |



# Background

- Transient faults are caused by
  - radiation, noise, power disturbance, etc.
- Not all faults lead to errors or failures
- Fault → Error → Failure
  - Masking Mechanism
    - Electrical Masking
    - Temporal Masking
    - Logical Masking
    - Functional Masking





## **Background – Masking Mechanism**



Source: E. Costenaro – "Techniques for the evaluation and the improvement of emergent technologies' behavior facing random errors," PhD Thesis, Université Grenoble Alpes, 2015.

6

# **Background - Masking**

- Electrical Masking
- Temporal Masking
- Logical Masking

 Functional Masking



(b) SET masked

clk

Z/D

(c) SET sampled (not masked)

# **Background - Masking**

- Electrical Masking
- Temporal Masking
- Logical Masking
- Functional Masking





8

# **Background - Masking**



- Electrical Masking
- Temporal Masking
- Logical Masking
- Functional Masking





# Background



- Fault 🛏 Error 🛏 Failure
  - De-Rating/Vulnerability Factor
    - Electrical De-Rating (EDR)
    - Temporal De-Rating (TDR)
    - Logical De-Rating (LDR)
    - Functional De-Rating (FDR)



$$FIT_{\rm Eff} = \sum_{\substack{\rm circuit \\ \rm elements}} FIT_{\rm Tech} \times EDR \times TDR \times LDR \times FDR \times \left\{ f_{\rm Tech} \right\}$$

failure class 1 failure class 2 ... failure class *n* 

ELICSIR 2021, 17 - 18 November 2021, T. Lange et al.

C C C C A aining Network



11

## Background

### • Fault 🛏 Error

- fault simulation
- structural design exploration
- propagation analysis

### • Error > Functional Failure

- accelerated testing
- simulation based approaches
- significant costs
  - human efforts
  - processing resources
  - tool licenses

ELICSIR 2021, 17 - 18 November 2021, T. Lange et al.

### **Basic Idea**

- Use Machine Learning for the reliability analysis
   →reduce the cost
- What are we trying to predict?
  - functional reliability metrics for Flip-Flops (FF)



• How do we do it?

RES

- gather as much information from the circuit as we can (features)
  - collection needs to be economical
- obtain Functional De-Rating (FDR) as training/reference data
- use Machine Learning techniques to train models



# **Initial Methodology**

### Extract features

- from Gate-Level Netlist and Testbench/Simulation
- Gather FDR Reference/Training data
  - by fault injection simulation for (parts of) the circuit
- Train a model
  - supervised regression
  - training size: number of training samples
- Predict FDR factors
  - per individual flip-flop instance
- Benchmark/Validate model
  - against reference data
  - cross validation is used

→Obtain model which is trained for one circuit

ELICSIR 2021, 17 - 18 November 2021, T. Lange et al.

### **Feature Extraction**

Feature Name

#### **Structural Related Features**

# FF at Startpoint/Endpoint
# Connections from/to FF
# Connections from/to Primary Input/Output
# FF Stages to/from Primary Input/Output (max/avg/min)
# Constant Drivers
Has Feedback
Feedback Depth
Is Part of Bus
Bus Position
Bus Length
Bus Label
Module Label

#### **Signal Activity Related Features**

@0/@1 State Changes

#### Synthesis Related Features

Drive Strength Depth Combinatorial Path # Combinatorial Cells at/from Input/Output

### • The feature set

#### characterizes each FF instance of the circuit

- contains attributes from
  - static elements
  - dynamic elements
- is extracted from the Gate-Level Netlist (GLN) and Simulation/Testbench





### • Target Flip-Flop *FF<sub>i</sub>*



RESCUE European Training Network

15

ELICSIR 2021, 17 - 18 November 2021, T. Lange et al.



#### • Feature: FF Fan-In = 3





ELICSIR 2021, 17 - 18 November 2021, T. Lange et al.



#### • Feature: FF Fan-Out = 4



ELICSIR 2021, 17 - 18 November 2021, T. Lange et al.

S

ш К



### • Feature: Nr of Connections from Primary Inputs = 2 Proximity from Primary Input (FF Stages) = 1





### • Feature: Nr of Connections to Primary Outputs = 3 Proximity to Primary Outputs (FF Stages) = 1





### • Feature: Feedback Loop = true Feedback Loop Depth (FF Stages)= 1





20

ELICSIR 2021, 17 - 18 November 2021, T. Lange et al.



### • Feature: Cell Properties – Drive Strength = 2



ELICSIR 2021, 17 - 18 November 2021, T. Lange et al.

ш



### • Feature: Signal Activity – Transitions @Q = 6– At 0/At 1 @Q = 5/3



ELICSIR 2021, 17 - 18 November 2021, T. Lange et al.

## **Model Implementation**

- Models are implemented using Python's scikit-learn framework
  - No licenses needed



- Several regression models have been evaluated
  - Different performance/error metrics have been applied
    - Coefficient of Determination:  $R^2 \in [-\infty, 1]$
- Cross validation fold of 10 and a training size of 50%



### **Prediction Results – Functional Failure**

### • Prediction Results (training size = 50%)



ELICSIR 2021, 17 - 18 November 2021, T. Lange et al.

24

# **Prediction Results (Example)**

- Prediction Results (training size = 50%)
  - $R^2 = 0.927$
- Model is trained and predicts FDR factors only for one circuit!



(e) SVR w/ RBF Kernel

### **Towards Training of a Universal Model**



26

S

ш 2

## **Considered Circuits**

### Benchmark circuits

- ISCAS'85/89
- ITC'99
- IWLS'2005
- OpenCores designs
  - 10GE MAC
  - Double Precision Floating Point
  - Secure Hash Algorithm 3 (SHA-3),
  - Advanced Encryption Standard (AES),
  - USB 2.0 Functional
  - etc.

### RISC-V Processor (picorv32, lowRISC ibex, rocket chip)



- Gathering the training and reference data is expensive
  - exhaustive fault injection simulation campaigns need to be performed
- Develop open source fault injection flow
  - based on open source simulators (Icarus Verilog, Verilator, ...)
  - better scalability of the simulation campaigns
  - make flow openly accessible (e.g. GitHub)
  - community can adapt and contribute to the collection of the data
- →Obtain large and open fault injection database



# **Conclusions and Perspective**

- Machine Learning can be used to predict reliability metrics
  Model is trained and predicts FDR factors for one circuit
- Create a Machine Learning based Reliability Analysis tool
  - train a tool on a variety of circuits, workloads, applications
  - able to predict reliability metrics in seconds on very large circuits
- Future work
  - Improve feature set/Add new feature to increase performance





