# Toward Real-time Fault-tolerance Through-Silicon-Via based 3D Network-on-Chips

#### Khanh N. Dang, Ph.D.

khanh.n.dang@ieee.org

VNU Key Laboratory for Smart Integrated Systems (SISLAB),

VNU University of Engineering and Technology (VNU-UET),

Vietnam National University, Hanoi (VNU)



The 2nd IEEE SEACAS Workshop Nov. 25–27, 2018 Bandung, Indonesia

#### Content

> Overview

> Project objectives

> Brief results & Discussion

Conclusion

#### Content

#### **Overview**

> Project objectives

> Brief results & Discussion

> Conclusion

#### **Overview**

As we reaching the multi/many core area, number of cores inside a chip is expectedly increased.



However, we observe several challenges:

- Parallelism
- Power limitation

# Shifting to unconventional interconnection

- Conventional bus system cannot scale up with the new multi/manycore era
  - A new interconnect architecture
- There is a strong shift recently:
  - $\circ$  AHB (single channel)  $\rightarrow$  AXI (multi channel)
  - Intel adopt ring connection for their new chip
  - AMD has the new Infinity Fabric
  - Future: Scalable Network-on-Chip?
- Last level cache
  - Shared
  - Distributed



# **Emerging Interconnect Materials**

- RF/Wireless: Replacing on-chip wires by integrated on-chip antennas to communicate with electromagnetic waves, in free space or guided medium.
- Carbone Nanotube: Using of carbon-based interconnect to replace the Cu/low-k technology.
- Photonic: Using photon instead of electron to transfer data.
- 3D Integration: Stacking multiple layers to obtain smaller footprints and shorter intra-layers interconnects.

### Toward the 3D structure



To keep up with the increase of integration density, moving to the third dimensions could be an promising solution

The near-future technology is TSV (Through Silicon Via)

#### 3D Network-on-Chips using TSVs



SEACAS 2018 Nov-18 8

#### Problems

- Thermal issue:
  - Thermal dissipation in 3D-IC is problematic
- Area:
  - Currently, TSV area is still big (1.4–10 μm)[3]
- Reliability issue:
  - Through-Silicon-Via is a fault sensitive device: misalignment, void, short-to-substrate
  - Due to the thermal issue, the fault rate is exponentially accelerated
  - Mechanical stress could also cause cracks/bend (thermal difference between layers could reach 10°C [7])

#### Content

> Overview

> Project objectives

> Brief results & Discussion

> Conclusion

# **Project objective**

Designing a 3D-NoC system with

- Fault-tolerance: provide method to detect, localize and recovery faults
- Real-time awareness: response to the new fault after a dedicated "deadline":
  - Can detect during operation
  - Provide a sufficient solution to handle it
- Thermal awareness: adapt and predict potential reliability issue due to thermal issue:
  - Predict the potential faults of hotspot
  - Provide back-up solution

#### Fault-tolerant phases

- **Fault Detection**
- Help the system understand there are new faults
- **Fault Localization**
- Find the location of the fault
- Fault Recovery
- Recover the system from having faults (i.e. spare, re-execution)

#### Equivalent electrical model of Cu-Cu interconnect.



[3] Jani et al. "BISTs for Post-Bond Test and Electrical Analysis of High Density 3D Interconnect defects" 23rd IEEE European Test Symposium

### Delay cause by misalignment and void



[3] Jani et al. "BISTs for Post-Bond Test and Electrical Analysis of High Density 3D Interconnect defects" 23rd IEEE European Test Symposium

#### Soft errors

- Transient faults (soft errors):
  - Since top layers act as shields, they can reduce the impact of cosmic ray
  - Smaller size of transistor may reduce the error per bit rate;
  - However, the increasing of density raise of error per chip rate.
- Crosstalk:
  - TSVs are usually place in parallel which is heavily affected by crosstalk

# Wear out defects

- Manufacture defect should be tested and recovered.
- However, during operation, new defects could occur:
  - Time-dependent gate oxide breakdown
  - Negative-bias temperature instability affect the latency
  - Electromigration
  - Mechanical stress might crack TSVs



# Real-time awareness: Response time to new fault

- Besides having high coverage, short response time is also a critical issue
  - Leaving the system under risk is undesirable
  - If checkpoint is used, it will take lesser cost
- Methods:
  - Off-line
  - On-line
- On-line:
  - Periodically scheduled
  - Interleaving test
  - On-communication/computation

### Periodically scheduled test

- State-of-the-art online testing for NoCs:
  - . Pre-schedule the test to a specific device (i.e. a router)
  - 2. Once the time is suitable:
    - 1. Detach the device
    - 2. Reroute the NoC
    - 3. Test the device
    - 4. After test, re-attach the device
- What is the major problem?
  - For real-time applications, each task (communication/computation) has a specific deadline.
  - Invoking test without considering it may cause system errors.

# Interleaving test

- Allow to test as long as it free
  - For instance: once no flit is routed to vertical connection, the test pattern is sent.
- Advantages
  - Minimize the degradation
  - Transactions have the highest priority
- Disadvantages:
  - If the utilization rate is high  $\rightarrow$  less chance to test
  - If the utilization rate is high → higher power consumption → higher temperature → higher fault rates

### On-communication/computation (OCT)



Deferred Test: let the system run under risky situation to ensure the real-time constraint. Test the quality after a deferred time  $\Delta_D$ 

# Fault recovery for TSV

- Spare (redundant) TSV for recovery:
  - Replace the faulty one by the spare (healthy) one.
  - Need to carefully consider the number of spare
- Algorithm-approach:
  - Using alternative communication paths
  - Remapping the system to avoid faulty paths.

#### Content

#### > Overview

> Project objectives

> Brief results & Discussion

#### Conclusion

# TSV sharing algorithm



[1] Khanh et al. "Scalable design methodology and online algorithm for TSV-cluster defects recovery in highly reliable 3D-NoC systems", IEEE Transactions on Emerging Topics in Computing (TETC) (in-press) SEACAS 2018 Nov-18

23

# **Reliability Evaluation**



(a) Layer size: 2 2 (4 routers, 16 TSV clusters); (b) Layer size: 4 4 (16 routers, 64 TSV clusters); (c) Layer size: 8 8 (64 routers, 256 TSV clusters); (d) Layer size: 16 16 (256 routers, 1024 TSV clusters); (e) Layer size: 32 32 (1024 routers, 4096 TSV clusters); (f) Layer size: 64 64 (4096 routers, 16384 TSV clusters). SEACAS 2018

24

**Nov-18** 

# High coding rate ECC [4]

- We use Parity Product Code (square code):
  - Parity for flit
  - Parity for packet
- The system can easily correct 1 fault
- 2+ fault:
  - Retransmit flit
  - Retransmit bit-index
- Lower rates:
  - Parity for multiple packets
  - May need to roll-back if there is a fault



# Parity for multiple packets [4]



# We use parity for multi packets technique named OPC (Overflow Packet Check) which is a deferred test technique.

[4] Khanh N. Dang and Xuan-Tu Tran, "Parity-based ECC and Mechanism for Detecting and Correcting Soft Errors in On-Chip Communication", 2018 IEEE 11th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Sep. 12-14, 2018

# OCT by utilizing spare TSV

# To perform OCT, we use spare TSV to keep the connect while testing

| M  | R | Κ  | # hidden error | detection rate |
|----|---|----|----------------|----------------|
| 5  | 1 | 8  | 352            | 0.9648         |
| 5  | 1 | 16 | 2              | 0.9998         |
| 5  | 1 | 32 | 0              | 1.00           |
| 5  | 2 | 8  | 1828           | 0.8172         |
| 55 | 2 | 16 | 4              | 0.9996         |
| 5  | 2 | 32 | 0              | 1.00           |
| 9  | 1 | 8  | 679            | 0.9321         |
| 9  | 1 | 16 | 3              | 0.9998         |
| 9  | 1 | 32 | 0              | 1.00           |
| 9  | 2 | 8  | 3921           | 0.9321         |
| 9  | 2 | 16 | 16             | 0.9998         |
| 9  | 2 | 32 | 0              | 1.00           |

This work is under preparation



(a) TSV group with two faults

(b) Isolating and shifting: still faulty



(c) Isolating and shifting: faults isolated

#### Content

> Overview

> Project objectives

> Brief results & Discussion

Conclusion

## Conclusion

- We have been working on fault-tolerance design for 3D-NoCs with
  - Real-time awareness with OCT
  - Cluster defect tolerance with TSV sharing algorithm
  - Adaptive soft error protection with deferred OCT
- In the future, we aim to provide a comprehensive for TSV with considerations:
  - Real-time
  - Thermal issue
  - Mix type of faults: soft error, crosstalk, permanent.
  - Dynamic Frequency/Sampling.

### Reference

- [1] Khanh et al. "Scalable design methodology and online algorithm for TSV-cluster defects recovery in highly reliable 3D-NoC systems", IEEE Trans. on Emerging Topics in Computing (TETC) (in-press)
- [2] Khanh N. Dang and Abderazek Ben Abdallah, "Architecture and Design Methodology for Highly-Reliable TSV-NoC Systems", Invited Book Chapter, Horizons in Computer Science Research. Volume 16, Chapter 7. Nova Science Publishers, 2018.
- [3] Jani et al. "BISTs for Post-Bond Test and Electrical Analysis of High Density 3D Interconnect defects" 23rd IEEE European Test Symposium
- [4] Khanh N. Dang and Xuan-Tu Tran, "Parity-based ECC and Mechanism for Detecting and Correcting Soft Errors in On-Chip Communication", IEEE 11th Int. Symp. on Emb. Multicore/Many-core SoCs, Sep. 12-14, 2018
- [5] J. Wang et al., "Efficient design-for-test approach for networks-on-chip," IEEE Trans. Comput., 2018.
- [6] L. Huang et al., "Non-blocking testing for network-on-chip," IEEE Trans. Comput., vol. 65, no. 3, pp. 679–692, 2016.
- [7] Y. J. Park et al., "Thermal analysis for 3D multi-core processors with dynamic frequency scaling," in 2010 IEEE/ACIS 9th Int. Conf. on Comput. and Inform. Sci. (ICIS). IEEE, 2010, pp. 69–74.

### Thank you for your attention!