



## Methodology for error rate prediction of applications implemented in Multi-core processors

Pablo Ramos

**PhD Student** 

## Methodology for error rate prediction of applications implemented in Multi-core processors

**Abstract**: Multi-core and many-core processors design is becoming a new challenge, since manufacturers have to face critical factors such as performance, reliability and power consumption. The exceptional computational capabilities of these devices make them very attractive for the implementation of high-performance applications in scientific and commercial fields. However, the continuous technology shrink combined with the design complexity, increase their vulnerability to natural radiation, especially to Single Event Effects (SEEs). A significant advantage of many-cores processors to face this vulnerability concern, consists in their inherent redundancy capability which makes them ideal for implementing fault tolerant techniques. In addition, for improving device reliability, complementary protection mechanisms such as Error Correcting Codes (ECC) and Parity are commonly implemented in memory cells. Nevertheless, implementing additional protections involves the introduction of an extra area which leads to more power consumption and performance degradation. The dependability of multi-core and many-core processors is a crucial issue to consider, especially if the devices are intended to be used for safety-critical applications. It is thus mandatory to evaluate the SEE sensitivity of such devices. This work evaluates the sensitivity of PowerPC P2041RDB and the many-core KALRAY MPPA-256.

- INTRODUCTION
- SOFT ERROR RATE ESTIMATION
  - REAL-LIFE TESTS
  - FAULT INJECTION CAMPAIGNS
    - Fault injection on multi-core P2041 processor
      - Targeting program variables
      - Targeting processor registers
    - Fault injection on MPPA many-core processor
  - RADIATION GROUND TESTING
    - Radiation campaigns in P2041 multi-core processor
    - Radiation campaings in MPPA -256 Many-core processor
- CONCLUSIONS AND FUTURE WORK

- INTRODUCTION
- SOFT ERROR RATE ESTIMATION
  - REAL-LIFE TESTS
  - FAULT INJECTION CAMPAIGNS
    - Fault injection on multi-core P2041 processor
      - Targeting program variables
      - Targeting processor registers
    - Fault injection on MPPA many-core processor
  - RADIATION GROUND TESTING
    - Radiation campaigns in P2041 multi-core processor
    - Radiation campaings in MPPA -256 Many-core processor
- CONCLUSIONS AND FUTURE WORK

## INTRODUCTION

Electronic circuits are sensitive to natural radiation including high-energy protons and energetic particles (electrons) protons, neutrons and ions), coming mainly from solar wind, cosmic rays and Van Allen radiation belts.

Credit: Asimetrie/Infn



## INTRODUCTION

• Natural radiation can change integrated circuits characteristics producing undesirable effects ranging from temporary to permanent failures. In microelectronics, these effects are called SEE (Single Event Effects).



A representative form of SEE is the SEU (Single Event Upset) which deposited energy causes a single bit of a memory cell to flip its logical state with unexpected consequences at the application level.









- INTRODUCTION
- SOFT ERROR RATE ESTIMATION
  - REAL-LIFE TESTS
  - FAULT INJECTION CAMPAIGNS
    - Fault injection on multi-core P2041 processor
      - Targeting program variables
      - Targeting processor registers
    - Fault injection on MPPA many-core processor
  - RADIATION GROUND TESTING
    - Radiation campaigns in P2041 multi-core processor
    - Radiation campaings in MPPA -256 Many-core processor
- CONCLUSIONS AND FUTURE WORK



FAULT INJECTION
Hardware implemented fault injection (HWIFI)

Software implemented fault injection (SWIFI)

## RADIATION GROUND TESTING



- INTRODUCTION
- SOFT ERROR RATE ESTIMATION
  - REAL-LIFE TESTS
  - FAULT INJECTION CAMPAIGNS
    - Fault injection on multi-core P2041 processor
      - Targeting program variables
      - Targeting processor registers
    - Fault injection on MPPA many-core processor
  - RADIATION GROUND TESTING
    - Radiation campaigns in P2041 multi-core processor
    - Radiation campaings in MPPA -256 Many-core processor
- CONCLUSIONS AND FUTURE WORK

## **REAL-LIFE TESTS**

## • SRAM-CHECKER

A SRAM-based board on 65nm technology will be placed in high altitude environments at different latitudes to detect SEEs in real environment.

### PILOT BOARD

- 64 SRAM memory chips on 65nm tech.
- Chip capacity: 16Mbit
- Total capacity: 1 Gbit
- Arduino Uno module for control
- 3G module for communications
- GPS capability



- INTRODUCTION
- SOFT ERROR RATE ESTIMATION
  - REAL-LIFE TESTS
  - FAULT INJECTION CAMPAIGNS
    - Fault injection on multi-core P2041 processor
      - Targeting program variables
      - Targeting processor registers
    - Fault injection on MPPA many-core processor
  - RADIATION GROUND TESTING
    - Radiation campaigns in P2041 multi-core processor
    - Radiation campaings in MPPA -256 Many-core processor
- CONCLUSIONS AND FUTURE WORK

## FAULT INJECTION ON MULTI-CORE PROCESSOR

Applying the CEU to multi-core processors



Fig 1: Flow chart of the fault injection approach on a standard matrix multiplication

- INTRODUCTION
- SOFT ERROR RATE ESTIMATION
  - REAL-LIFE TESTS
  - FAULT INJECTION CAMPAIGNS
    - Fault injection on multi-core P2041 processor
      - Targeting program variables
      - Targeting processor registers
    - Fault injection on MPPA many-core processor
  - RADIATION GROUND TESTING
    - Radiation campaigns in P2041 multi-core processor
    - Radiation campaings in MPPA -256 Many-core processor
- CONCLUSIONS AND FUTURE WORK

## Architecture Freescale P2041



Fig. II: QorIQ P2041 memory architecture

| Sensitive zone | Location         | Capacity                         | Description              |
|----------------|------------------|----------------------------------|--------------------------|
| L1             | Cores 0, 1, 2, 3 | 32 KB / D and 32 KB / I per core | Data / Instruction Cache |
| L2             | Cores 0, 1, 2, 3 | 128 KB per core                  | Backside Unified Cache   |
| L3             | Multi-core       | 1024 KB per chip                 | Frontside cache          |
| GPR            | Cores 0, 1, 2, 3 | 32 registers of 32 bits          | General purpose register |
| FPR            | Cores 0, 1, 2, 3 | 32 registers of 64 bits          | Floating point register  |

Table I. Sensitive areas of the P2041 multi-core processor

- INTRODUCTION
- SOFT ERROR RATE ESTIMATION
  - REAL-LIFE TESTS
  - FAULT INJECTION CAMPAIGNS
    - Fault injection on multi-core P2041 processor
      - Targeting program variables
      - Targeting processor registers
    - Fault injection on MPPA many-core processor
  - RADIATION GROUND TESTING
    - Radiation campaigns in P2041 multi-core processor
    - Radiation campaings in MPPA -256 Many-core processor
- CONCLUSIONS AND FUTURE WORK

## FAULT INJECTION ON P2041 MULTI-CORE PROCESSOR

Fault injection in program variables of MM cache disabled



Fig. III: Fault injection consequences in the application implemented on the P2041

## FAULT INJECTION ON P2041 MULTI-CORE PROCESSOR

Fault injection in program variables of MM caches enabled

Table II: Sensitive areas of the P2041 multi-core processor

| SEUs per run | Runs  | Silent fault | Result errors | Time outs | Exceptions | SER (%) |
|--------------|-------|--------------|---------------|-----------|------------|---------|
| 1            | 99069 | 34410        | 64657         | 1         | 1          | 65.27   |



Fig IV: Number of errors vs execution time

Fig V: Number of errors vs matrices addresses

- INTRODUCTION
- SOFT ERROR RATE ESTIMATION
  - REAL-LIFE TESTS
  - FAULT INJECTION CAMPAIGNS
    - Fault injection on multi-core P2041 processor
      - Targeting program variables
      - Targeting processor registers
    - Fault injection on MPPA many-core processor
  - RADIATION GROUND TESTING
    - Radiation campaigns in P2041 multi-core processor
    - Radiation campaings in MPPA -256 Many-core processor
- CONCLUSIONS AND FUTURE WORK

## SOFT ERROR RATE ESTIMATION IN ELECTRONIC DEVICES

## Fault injection in processor registers



Fig VI: Consequences of fault injection in processor registers

- INTRODUCTION
- SOFT ERROR RATE ESTIMATION
  - REAL-LIFE TESTS
  - FAULT INJECTION CAMPAIGNS
    - Fault injection on multi-core P2041 processor
      - Targeting program variables
      - Targeting processor registers
    - Fault injection on MPPA many-core processor
  - RADIATION GROUND TESTING
    - Radiation campaigns in P2041 multi-core processor
    - Radiation campaings in MPPA -256 Many-core processor
- CONCLUSIONS AND FUTURE WORK

## Architecture KALRAY MPPA-256



#### Fig.VII: MPPA- 256 KALRAY memory architecture

| Sensitive zone | Location          | Capacity                         | Description               |
|----------------|-------------------|----------------------------------|---------------------------|
| SMEM           | Computing Cluster | 2 MB per cluster                 | Static Shared Memory      |
| IC             | VLIW core         | 8 KB per core                    | Instruction Cache         |
| DC-CC          | CC VLIW core      | 8 KB per core                    | Separated Data cache      |
| DC-IO          | I/O cluster       | 128 KB per I/O                   | Shared Data cache         |
| GPR            | VLIW Core         | 64 registers of 32 bits per core | General Purpose Registers |
| SFR            | VLIW Core         | 64 registers of 32 bits per core | System Function Registers |

Table III. Sensitive areas of the mppa-256 many-core processor

## FAULT INJECTION ON MPPA MANY-CORE PROCESSOR

## Fault injection in processor registers GPRs and SFRs

Table IV. Sensitive areas of the mppa-256 many-core processor

| Zone  | Silent Faults | Result errors | Time-outs | Hangs |
|-------|---------------|---------------|-----------|-------|
| GPRs  | 36472         | 16387         | 6678      | 1996  |
| SFRs  | 2200          | 1696          | 2448      | 1432  |
| Total | 38672         | 18083         | 9126      | 3428  |

From these results, it can be calculated the error-rate of the registers applying the following equation and considering as errors the result errors, time-outs and hangs.

$$SER = \frac{\text{Number of errors}}{\text{Faults Injected}} = \frac{30637}{69309} = 44.20 \times 10^{-2}$$

- INTRODUCTION
- SOFT ERROR RATE ESTIMATION
  - REAL-LIFE TESTS
  - FAULT INJECTION CAMPAIGNS
    - Fault injection on multi-core P2041 processor
      - Targeting program variables
      - Targeting processor registers
    - Fault injection on MPPA many-core processor
  - RADIATION GROUND TESTING
    - Radiation campaigns in P2041 multi-core processor
    - Radiation campaings in MPPA -256 Many-core processor
- CONCLUSIONS AND FUTURE WORK

# RADIATION GROUND TESTING RADIATION FACILITY

Accelerated radiation experiments took place in GENEPI2 facility located in Laboratoire de Physque Subatomique et de Cosmologie (LPSC) Grenoble-France.

Energy: 3 MeV or 14 MeV neutron beam.





Fig.VIII: Experiments at Genepi2



## RADIATION GROUND TESTING

## RADIATION FACILITY VALIDATION

## IMPORTANT ISSUE

#### VALIDATE THE FACILTY FOR ELETRONICS

### Device tested

CMOS SRAM 90nm memory from CYPRESS, 16 Mbit capacity (CY62167EV30LL).

## Test Parameters

Energy: 15 MeV neutrons Neutron Flux: 3x10<sup>4</sup> n.cm<sup>-2</sup> .S<sup>-1</sup> Distance from target: 40 cm Exposure duration: 1 hour



# RADIATION GROUND TESTING RADIATION GROUND TEST RESULTS **Events observed**

Data Pattern: 0x5555(0b0101010101010101)

Single Event Upset (SEU)

Multiple Bit Upsets (MBU)

Multiple Cell Upset (MCU)

Table V. Single events upsets types

| Chip | Address  | 5 Data                                                |
|------|----------|-------------------------------------------------------|
| 0x3B | 0xEE6CC  | 0x <b>D</b> 555(0b <b>1</b> 101010101010101)          |
|      |          |                                                       |
| 0x3B | 0x657FA  | 0x <b>D</b> 557(0b <b>1</b> 10101010101011 <b>1</b> ) |
|      | $\frown$ |                                                       |
| 0x3B | 0x657F6  | 0x5557 (0b01010101010111)                             |
| 0x3B | 0x657FA  | 0x <b>D</b> 557(0b <b>1</b> 101010101010111)          |
| 0x3B | 0x657BE  | 0xF557(0b111101010101111)                             |



## RADIATION GROUND TESTING

## RADIATION GROUND TEST RESULTS

Cross-section ( $\sigma$ ): Is a quantity that express the sensitivity of a component exposed to ionizing radiation (cm<sup>2</sup>/bit or cm<sup>2</sup>/component).



Fig. IX: Neutron and proton cross-section of SRAM memories

- INTRODUCTION
- SOFT ERROR RATE ESTIMATION
  - REAL-LIFE TESTS
  - FAULT INJECTION CAMPAIGNS
    - Fault injection on multi-core P2041 processor
      - Targeting program variables
      - Targeting processor registers
    - Fault injection on MPPA many-core processor
  - RADIATION GROUND TESTING
    - Radiation campaigns in P2041 multi-core processor
    - Radiation campaings in MPPA -256 Many-core processor
- CONCLUSIONS AND FUTURE WORK

## Static Radiation Test P2041

#### Table VI. Static test results

| SEE Type | Type of error              | Occurrences | Consequences |
|----------|----------------------------|-------------|--------------|
|          |                            |             |              |
| SEU      | L1 Instruction parity      | 0           | Hang         |
| SEU      | L1 Data cache parity       | 9           | None         |
| SEU      | L2 Single-bit ECC          | 29          | None         |
| SEFI     | L2 Tag parity              | 5           | Hang         |
| SEU      | L2 Multiple-bit Tag Parity | 1           | None         |
| SEU      | L3 Single-bit ECC          | 7           | None         |
| SEFI     | L3 Multiple-bit ECC        | 6           | Hang         |
| SEFI     | Other errors               | 1           | Hang         |
| Total    |                            | 58          |              |

#### Eq. 1: Static Cross-section

 $\sigma_{\text{STATIC}} = \frac{\text{Number of upsets}}{\text{Fluency}} = \frac{58}{1.41 \text{x} 10^9} = 4.11 \text{x} 10^{-8} \frac{cm^2}{device}$ 

For a 95% confidence interval:

$$3.12 \times 10^{-8} \frac{\text{cm}^2}{\text{dev}} < \sigma_{\text{STATIC}} < 5.32 \times 10^{-8} \frac{\text{cm}^2}{\text{dev}}$$

## **Dynamic Radiation test P2041**

| SEE Type | Type of error      | Test 1 | Test 2 | Consequences      |
|----------|--------------------|--------|--------|-------------------|
| SEFI     | Load Instruction   | 1      | 0      | Hang              |
| SEU      | L1 Data parity     | 19     | 17     | None              |
| SEU      | L2 Single-bit ECC  | 9      | 20     | None              |
| SEFI     | L2 Tag parity      | 0      | 4      | Hang              |
| SEU      |                    | 3      | 1      | None              |
| SEU      | Multiple L2 errors | 3      | 1      | None              |
| SEU      | L3 Single-bit ECC  | 3      | 2      | None              |
| SEFI     | Instruction fetch  | 0      | 1      | Hang              |
| МВО      | Other errors       | 6      | 0      | App. result error |
| Total    |                    | 44     | 46     |                   |

#### Table VII. Dynamic test results

Application result errors: Three clusters of errors occurred in Core 2, and one in Core 1. All of them were very closely related and they were detected in the same read cycle. Each cluster involves exactly 16 consecutive positions of the resulting matrix. Each matrix element was an integer value (4 bytes). In all cases, an incorrect result of "2" was observed instead of the expected "160".

## **Dynamic Radiation test P2041**



Fig. X: QorIQ P2041 memory architecture

Taking into account the data address mapping shown in Figure 4.11 (a):

Any line tag comprised in the interval (0x403D6 - 0x403DC) (matrix B) could have become the cluster error line tag. Comparing the tags of the clusters of errors with each one of the tags in the previous interval, it was possible to detect a MBU affecting bits b1 and b2 due to their physical adjacency. For the three cases the tags had to be changed (from 0x403DB to 0x403DD and from 0x403D8 to 0x403DE). These errors were not detected by the parity protection mechanisms since parity bit remains the same. Note that the L1 cache implements only one parity bit per tag. Thus, in the authors' opinion, a particle modified two consecutive bits (MBU) belonging to three different tags (Multiple Cell Upset with multiplicity of three). Moreover, when decoding the corrupted addresses, it was possible to determine that the cache lines in Sets 0x1A, 0x1E and 0x20 were affected.

# Error-rate Prediction P2041

• Dynamic cross-section from application errors

$$0.22 x 10^{-8} \frac{cm^2}{dev} \ < \ \sigma_{DYNAMIC} \ < \ 0.73 x 10^{-8} \frac{cm^2}{dev}$$

• Application error-rate from fault injection and static cross-section

$$\tau_{SEU} = \tau_{inj} * \sigma_{STATIC}$$

$$\tau_{SEU} = 0.65 * 4.11 \times 10^{-8} = 2.67 \times 10^{-8} \frac{cm^2}{dev}$$

Comparing the predicted value with the calculated confidence interval, it can be seen a considerable overestimation. Therefore, the CEU approach does not provide a good estimation of the error rate since the device implements ECC and parity in their cache memories that correct most of the detected errors either by the ECC, or by cache invalidation.

- INTRODUCTION
- SOFT ERROR RATE ESTIMATION
  - REAL-LIFE TESTS
  - FAULT INJECTION CAMPAIGNS
    - Fault injection on multi-core P2041 processor
      - Targeting program variables
      - Targeting processor registers
    - Fault injection on MPPA many-core processor
  - RADIATION GROUND TESTING
    - Radiation campaigns in P2041 multi-core processor
    - Radiation campaings in MPPA -256 Many-core processor
- CONCLUSIONS AND FUTURE WORK

## Static Radiation Test MPPA

#### Table VIII. Static test results

| Detected Error | SEE Type | Occurrences | Bit-flip Cells |
|----------------|----------|-------------|----------------|
| SECC           | SEU      | 1949        | 1949           |
| SECC           | MCU (2)  | 322         | 644            |
| SECC           | MCU (3)  | 24          | 72             |
| SECC           | MCU (4)  | 8           | 32             |
| SECC           | MCU (5)  | 2           | 10             |
| SECC           | MCU (6)  | 1           | 6              |
| SECC           | MCU (7)  | 1           | 7              |
| Other error    | SEFI     | 1           | 1              |
|                | Total    | 2308        | 2721           |



## Fig. X1:Distribution of the neutron particles perturbing the SMEMs of the clusters.

## Static Radiation Test MPPA

$$\sigma_{_{STATIC}} = \frac{numb \ of \ upsets}{Fluence} = \frac{2721}{8.64x10^8} = 3.15x10^{-6} \ \frac{cm^2}{dev}$$

Since the tested memory area of the many-core processor represents  $2.6x10^8$  bits, the static cross-section per bit of the SMEMs is about  $1.21x10^{-14}(cm^2/bit)$ .

Assuming that the technology of the memory cells is similar for the different memory zones of the device, the cross-section of the GPRs and SFRs can be extrapolated from the cross-section per bit. Taking into account that there are [64 (GPRs) + 50 (SFRs)] x 32 (bits) x [256 (PE) + 32 (RM)] the registers' sensitivity for the device can be expressed as:

$$\sigma_{\text{STATIC REG}} = 1050624 \ bit \ x \ 1,21 x \ 10^{-14} \frac{cm^2}{bit} = 12,71 x \ 10^{-9} \ \frac{cm^2}{dev}$$

## **Dynamic Radiation test P2041**

Table IX. Dynamic test results cache disabled

Table X. Dynamic test results cache enabled

| Detected<br>Error              | SEE<br>Type | Occurren<br>ces | Consequenc<br>es     |
|--------------------------------|-------------|-----------------|----------------------|
| SECC                           | SEU         | 676             | None                 |
| Data cache<br>parity           | SEU         | 36              | None                 |
| Inst. cache<br>parity          | SEU         | 6               | None                 |
| Register Trap                  | SEFI        |                 | Hang                 |
| Memory<br>comparison<br>failed | SEU         | 2               | App. result<br>error |
| Total                          |             | 721             |                      |





Figure XII. Confidence intervals of the application cross-section cache enabled vs disabled



## **Error-rate Prediction MPPA**

• Dynamic cross-section from application errors

$$\sigma_{Dynamic} = \frac{5}{8.64x10^8} = 5.78x10^{-9} \frac{cm^2}{dev}$$

With a 95% of confidence interval:

$$\sigma_{\text{DYNAMIC}=}[1.87 - 13.50] \times 10^{-9} \frac{\text{cm}^2}{\text{dev}}$$

• Application error-rate from fault injection and static cross-section

 $\tau_{SEU} = \tau_{inj} * \sigma_{STATIC}$ 

$$\tau_{SEU} = 0.44 * 12.71 \times 10^{-9} = 5.59 \times 10^{-9} \frac{cm^2}{dev}$$

Comparing the predicted and the measured error rate, it can be seen that the CEU approach gives a good approximation since the relative error is about 3.4%. The underestimation of the predicted error-rate can be explained since not all SFR registers were targeted in the fault-injection campaign.

- INTRODUCTION
- SOFT ERROR RATE ESTIMATION
  - REAL-LIFE TESTS
  - FAULT INJECTION CAMPAIGNS
    - Fault injection on multi-core P2041 processor
      - Targeting program variables
      - Targeting processor registers
    - Fault injection on MPPA many-core processor
  - RADIATION GROUND TESTING
    - Radiation campaigns in P2041 multi-core processor
    - Radiation campaings in MPPA -256 Many-core processor

• CONCLUSIONS AND FUTURE WORK

## CONCLUSIONS AND FUTURE WORK

- ✓ Fault injection campaigns in program variables and accessible registers of a 40x40 and 80x80 matrix multiplication with cache enabled and disabled were performed. Results show that the input matrices are two times more sensitive to SEU that the output matrix.
- Results evidence that fault injection is very useful to analyze the behavior of an application in presence of SEU type events, providing the possibility to modify the program code according to the obtained results to gain in reliability by reducing the impact of faults in the results of the application.
- Radiation tests have been performed on the P2041 platform with the aim of evaluating the sensitivity to 14 MeV neutrons of a 45nm SOI P2041 multi-core processor. From the static test results, it can be seen that 45 nm SOI technology is between 3 and 5 times less sensitive to neutron radiation than its CMOS counterpart.
- ✓ Dynamic tests have demonstrated that in spite of the parity and ECC protection mechanisms, errors have been occurred in the application results. A deeper analysis has allowed determining that errors were caused by MBUs in the *address tags* and data array.
- ✓ The CEU approach developed at TIMA for fault injection and error-rate prediction has been adapted for the first time to a multi/many-core processor benefiting of the multiplicity of cores. Results show an overestimation in the predicted error-rate since the device implements ECC and parity in its cache memories.

## CONCLUSIONS AND FUTURE WORK

- This work presents the 14 MeV neutron cross-section of the static memories of the MPPA-256 many-core processor built-in 28nm CMOS technology, and the evaluation of the device's dynamic response.
- Dynamic tests demonstrate that by enabling the cache memories, it is possible to double the performance of the device without a considerable reliability penalty, since cache memories implement an effective parity protection and their area only represents the 128th part of the whole memory of the compute clusters.
- These results support the conclusion made by the authors of [11], who demonstrated that by enabling L1 cache makes possible to improve the overall reliability of an embedded processor since the larger exposed area may be compensated by the shorter exposure time.
- Despite the significant increase of sensitive zones and device complexity, this work have demonstrated the efficiency of the CEU approach to predict SEU error-rate in processor-based architectures.
- In future work, a similar approach complemented with fault tolerance techniques will be applied to evaluate the many-core processor running with operating system.



