<0 -467 strogn th\* · his his Struct 1.nt 30 Scepsprvector3 31 Scepsprvector3 35 33 node; float, L3 BALLOONDAT; POST 34 sbut(3); 35 SEDIE Static BALLOONDAT. 36 ter static ScePspFVector3 37 static ScePspFVector3 38 balloon; sphere(28); 39 extern. pole[20]; void DrawSphere(ScePspFVector3 \*arroy,flest r); 40 extern. void DrawPole(ScePspFVector3 \*arrey, floet n); 41 42 void init\_balloon(void) 44 int. 1; 45 balloon.mode=MODE 46 **Operating Systems and C** balloon.pos.x= 0. 47 balloon.pos.y=-8. 48 balloon.pos.z= 0, balloon.t=0.01; Fall 2022, Performance-Track 49 balloon.scnt=2; 50 51 for (1=0; 1<3; 1-. balloon.sbuf 6. Locality 52 balloon.sbufi 53 印. balloon.sbuf 54 55 void draw\_balloon(void) 56 57 5 SCEPSPEVector3 vec; HIG (SCEGU\_TEXTURE); 58 59 (); pos); 60 1214

# Parallel Tracks

## perflab:

**performance-track lecture nr. 1:** *what you'll need for perflab.* array layout, what it means for performance, cache hierarchy, how associativity is organized in the cache. (important stuff for anyone)

- have two matrix multiplication procedures (rotate, smooth), have to rewrite it.
- about optimization techniques; blocking, loop unrolling, etc.
   attacklab:

performance-track lecture nr. 2: what you'll need for perflab. optimizations. how to write code so compiler can derive performant code. manual transformations, blocking, loop unrolling

- you have an executable, have to attack it.
- about the stack; code injection (smashing the stack), return-oriented programming (find interesting code in other programs).

security-track lecture nr. 2: Linux culture.



- Locality
- Memory Hierarchy
- Cache Utilization
- A note on Security

http://jimgray.azurewebsites.net/jimgraytalks.htm





# Jim Gray, 2006

http://jimgray.azurewebsites.net/jimgraytalks.htm

layout of data in memory

#### **RAM Locality is King**

- The cpu mostly waits for RAM
- Flash / Disk are 100,000 ...1,000,000 clocks away from cpu
- RAM is ~100 clocks away unless you have locality (cache).
- If you want 1CPI (clock per instruction) you have to have the data in cache (program cache is "easy")
- This requires cache conscious data-structures and algorithms sequential (or predictable) access patterns
- Main Memory DB is going to be common.



10 years

#### Data Systems Group Promotion



microarchitectural analysis,
benchmarking, ...
how resources are used.
(how much time CPU spends waiting for memory?)



Principle of Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently

Temporal locality:

Recently referenced items are likely

to be referenced again in the near future

Spatial locality:

<u>VERSITY OF COPENHAGEN</u>

Items with nearby addresses tend to be referenced close together in time





## Locality Example

sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum;

Data references

Reference array elements in succession (stride-1 reference pattern).

Reference variable  $\operatorname{sum}$  each iteration. Instruction references

Reference instructions in sequence. Cycle through loop repeatedly. Spatial locality Temporal locality Spatial locality Temporal locality

#### Qualitative Estimates of Locality

Claim: Being able to look at code and get a qualitative sense of its locality is a key skill for a professional programmer.

Question: Does this function have good locality with respect to array a?

```
int sum_array_rows(int a[M][N])
{
    int i, j, sum = 0;
    for (i = 0; i < M; i++)
        for (j = 0; j < N; j++)
            sum += a[i][j];
    return sum;
}</pre>
```

```
int sum_array_cols(int a[M][N])
{
    int i, j, sum = 0;
    for (j = 0; j < N; j++)
        for (i = 0; i < M; i++)
            sum += a[i][j];
    return sum;
}</pre>
```

IT UNIVERSITY OF COPENHAGEN

Q: which is faster?

# Memory Hierarchies

- Some fundamental and enduring properties of hardware and software:
  - Fast storage technologies cost more per byte, have less capacity, and require more power (heat!).
  - The gap between CPU and main memory *speed* is widening.
  - Well-written programs tend to exhibit good locality.
- These fundamental properties complement each other beautifully.
- They suggest an approach for organizing memory and storage systems known as a memory hierarchy.

http://jimgray.azurewebsites.net/jimgraytalks.htm

# Jim Gray, 2006



# O Bus



wires that carry address, data, and control signals. Buses are typically shared by multiple devices.

# Reading a Disk Sector (1)



### Reading a Disk Sector (2)





#### Solid-State Drives



# **Open-Channel SSDs: Design Space**



LightNVM separates

(application-customisable) front-end SSD management

IT UNIVERSITY OF COPENIE (media-specific) back-end SSD management.

Mathias Bjørling, PhD at ITU,

# Programming the Storage Controller

#### Put Everything in Future (Disk) Controllers (it's not "if", it's "when?")

Jim Gray

http://www.research.Microsoft.com/~Gray

Acknowledgements: Dave Patterson explained this to me a year ago Kim Keeton Erik Riedel || Helped me sharpen these arguments

Catharine Van Ingen



#### **Basic Argument for x-Disks**

- Future disk controller is a super-computer.
   >> 1 bips processor
  - »128 MB dram
  - »100 GB disk plus one arm
- Connects to SAN via high-level protocols
   » RPC, HTTP, DCOM, Kerberos, Directory Services,....
   » Commands are RPCs
  - >>management, security,....
  - >>>Services file/web/db/... requests
- Move apps to disk to save data movement
  - »need programming environment in controller

Niclas works (worked?) on this!

Jim Gray, NASD Talk, 6/8/98

http://jimgray.azurewebsites.net/jimgraytalks.htm

#### Data Systems Group Promotion

#### **Computational Storage**

By offloading processing to storage, we can deal efficiently with very large volumes of stored data. We work with prototypes composed of Open-Channel SSDs and a programmable storage controller (i.e., a Linux-based ARM processor) integrated into a network switch. Topics for thesis include (1) key-value store on the storage controller, (2) evaluation of 100GE RPC, and (3) application-specific FTL.

#### **FPGA-based Hardware Acceleration**

Field Programmable Gate Arrays are now an integral part of public cloud infrastructures. You can for example run customized FPGA instances on AWS. A project focuses on FPGA-based hardware acceleration at the level of an SSD Flash Translation Layer, at the level of a Database storage manager or at the level of the database client. You will be able to experiment with FPGAs in the lab and on AWS.



#### PHILIPPE BONNET Professor

### Memory Read Transaction (1)

#### CPU places address A on the memory bus.



Memory Read Transaction (2)

IT UNIVERSITY OF COPENHAGEN

Main memory reads A from the memory bus, retrieves word x, and places it on the bus.



## Memory Read Transaction (3)

CPU read word x from the bus and copies it into register %eax.



Memory Write Transaction (1)

CPU places address A on bus. Main memory reads it and waits for the corresponding data word to arrive.



#### Memory Write Transaction (2)

IT UNIVERSITY OF COPENHAGEN

CPU places data word y on the bus. Main memory reads data word y from bus and stores it at address A.



the way RAM is organized is as an array of "supercells"

d \* w DRAM:

dw total bits organized as d supercells of size w bits



#### An Example Memory Hierarchy

this is actually a myth; local network faster than some local disks.



#### Caches

car mechanic analogy

- *Cache:* A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device.
- Fundamental idea of a memory hierarchy:
  - For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1.
- Why do memory hierarchies work? Because **locality**.
  - Programs tend to access the data at level k more often than they access the data at level k+1.
  - Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit.
- *Big Idea:* Memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bottom, but serves data to programs at the rate of the fast storage near the top.



#### General Cache Concepts: Hit



#### **General Cache Concepts: Miss**



# General Caching Concepts: Types of Cache Misses

#### Cold (compulsory) miss

Cold misses occur because the cache is empty.

**Conflict** miss

Most caches limit lines at level k+1 to a small subset (sometimes a singleton) of the line positions at level k.

- E.g. Line i at level k+1 must be placed in line (i mod 4) at level k.
   Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k line.
- E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.

#### Capacity miss

Occurs when set of active cache lines (working set) is larger than the cache.

IT UNIVERSITY OF COPENHAGEN

placement policy avoids conflict misses replacement policy care, to avoid capacity misses.

thrashing = every access misses.

### Examples of Caching in the Hierarchy

ex: access to memory is 100x more expensive than L1 cache.

| Cache Type              | What is Cached?      | Where is it Cached? | Latency (cycles) | Managed By          |
|-------------------------|----------------------|---------------------|------------------|---------------------|
| Registers               | 4-8 bytes words      | CPU core            | 0                | Compiler            |
| TLB                     | Address translations | On-Chip TLB         | 0                | Hardware            |
| L1 cache                | 64-bytes line        | On-Chip L1          | 1                | Hardware            |
| L2 cache                | 64-bytes line        | On/Off-Chip L2      | 10               | Hardware            |
| Virtual Memory          | 4-KB page            | Main memory         | 100              | Hardware + OS       |
| Buffer cache            | Parts of files       | Main memory         | 100              | OS                  |
| Disk cache              | Disk sectors         | Disk controller     | 100,000          | Disk firmware       |
| Network buffer<br>cache | Parts of files       | Local disk          | 10,000,000       | AFS/NFS client      |
| Browser cache           | Web pages            | Local disk          | 10,000,000       | Web browser         |
| Web cache               | Web pages            | Remote server disks | 1,000,000,000    | Web proxy<br>server |

#### Cache Memories

IT UNIVERSITY OF COPENHAGEN

given some data, where is it going to be located?

10s

Cache memories are small, fast SRAM-based memories managed automatically in hardware.

Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and L3), then in main memory. Typical system structure:



#### General Cache Organization (S, E, B)

organization of a cache. have S sets, and E lines.



#### Cache Read



•Locate set

#### **Direct-Mapped Cache Simulation**

Set 2

Set 3

we ask for 1 byte. but we transfer 1 line at a time. here, a line is 2 bytes.



M=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 Blocks/set



M[6-7]

0

# Why index with the middle bits?

4-set cache 00 01 10 11 every cell on the right is 1 byte. line size 2 bytes. Q: which indexing strategy is best?

IT UNIVERSITY OF COPENHAGEN



.10.2020

01.10.2020

· 38

# Why index with the middle bits?

4-set cache 00 01 10 11 every cell on the right is 1 byte

right is 1 byte. line size 2 bytes.

**Q:** which indexing strategy is best?

hint: sequential access





Set index bits

· 38

01.10.2020

## Why index with the mi 50% miss

ts? 50% miss, underutilized better temporal cache High-order Middle-order Low-order bit indexing bit indexing bit indexing <u>11</u>10 .10.2020 · 38 Set index bits

01.10.2020

4-set cache

every cell on the right is 1 byte. line size 2 bytes.

Q: which indexing strategy is best?

hint: sequential access

compiler is going to reserve space for the array.

## Basic Principle

- T A[L];
- A is an Array of data type T and length L
- Contiguously allocated region of  $L * \mathtt{sizeof}(T)$  bytes



## Array Allocation



IT UNIVERSITY OF COPENHAGEN

01.10.2020 . 39

## Array Access

int  $A[5] = \{0, 1, 2, 3, 4\};$ 

Array of data type *int* and length 5

Identifier **A** can be used as a pointer to array element 0: Type *int*\*

val from previous slide

| Reference      | Туре  |    | Value        |  |  |
|----------------|-------|----|--------------|--|--|
| val[4]         | int   | 4  |              |  |  |
| val            | int * |    | X            |  |  |
| val+1          | int * |    | <i>x</i> + 4 |  |  |
| &val[2]        | int * |    | <i>x</i> + 8 |  |  |
| val[5]         | int   | ?? |              |  |  |
| *(val+1)       | int   | 1  |              |  |  |
| val + <i>i</i> | int * |    | x + 4 i      |  |  |



## Array Example

20 = 5\*4

IT UNIVERSITY OF COPENHAGEN



Declaration "zip\_dig cmu" equivalent to "int cmu[5]" Example arrays were allocated in successive 20 byte blocks

Not guaranteed to happen in general

## Array Example



## A Higher Level Example

```
int sum_array_rows(double a[10][10])
{
    int i, j;
    double sum = 0;
    for (i = 0; i < 16; i++)
        for (j = 0; j < 16; j++)
            sum += a[i][j];
    return sum;</pre>
```

```
int sum_array_cols(double a[10][10])
{
    int i, j;
    double sum = 0;
    for (j = 0; i < 16; i++)
        for (i = 0; j < 16; j++)
            sum += a[i][j];
    return sum;
}</pre>
```



## What about writes?

Multiple copies of data exist:

L1, L2, Main Memory, Disk What to do on a write-hit?

faster

Write-through (write immediately to memory)

Write-back (defer write to memory until replacement of line)

Need a dirty bit (line different from memory or not)
 What to do on a write-miss?

Write-allocate (load into cache, update line in cache)

- Good if more writes to the location follow

No-write-allocate (writes immediately to memory)

Typical

Write-through + No-write-allocate

Write-back + Write-allocate

## Intel Core i7 Cache Hierarchy

Processor package



# **Cache Performance Metrics**

- Miss Rate
  - Fraction of memory references not found in cache (misses / accesses)
     = 1 hit rate
  - Typical numbers (in percentages):
  - 3-10% for L1
  - can be quite small (e.g., < 1%) for L2, depending on size, etc.

#### • Hit Time

- Time to deliver a line in the cache to the processor
- includes time to determine whether the line is in the cache
- Typical numbers:
- 1-2 clock cycle for L1
- 5-20 clock cycles for L2

#### • Miss Penalty

- Additional time required because of a miss
- typically 50-200 cycles for main memory (Trend: increasing!)

# Latency Numbers Every Programmer Should Know

we've seen this. penalty: from 1 to 100. huge difference.





https://colin-scott.github.io/personal\_website/research/interactive\_latency.html

## Lets think about those numbers

## Huge difference between a hit and a miss

Could be 100x, if just L1 and main memory

#### Would you believe 99% hits is twice as good as 97%?

Consider: cache hit time of 1 cycle miss penalty of 100 cycles

Average access time:

IT UNIVERSITY OF COPENHAGEN

97% hits: 1 cycle + 0.03 \* 100 cycles = **4 cycles** 99% hits: 1 cycle + 0.01 \* 100 cycles = **2 cycles** 

This is why "miss rate" is used instead of "hit rate"

# Arrays and Cache Metrics

C arrays allocated in row-major order

• each row in contiguous memory locations Stepping through columns in one row:

- for (i = 0; i < N; i++)
  sum += a[0][i];</pre>
- accesses successive elements
- if block size (B) > 4 bytes, exploit spatial locality

compulsory miss rate = 4 bytes / B

Stepping through rows in one column:

- for (i = 0; i < n; i++)
  sum += a[i][0];</pre>
- accesses distant elements
- no spatial locality!

compulsory miss rate = 1 (i.e. 100%)

skip

IT UNIVERSITY OF COPENHAGEN

## Make the common case go fast Focus on the inner loops of the core functions

#### Minimize the misses in the inner loops

Repeated references to variables are good (temporal locality) Stride-1 reference patterns are good (spatial locality)

Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories.

Question: Can you permute the loops so that the function scans the 3-d array a with a stride-1 reference pattern (and thus has good spatial locality)?

int sum\_array\_3d(int a[M][N][N])
{
 int i, j, k, sum = 0;
 for (i = 0; i < M; i++)
 for (j = 0; j < N; j++)
 for (k = 0; k < N; k++)
 sum += a[i][k][j];
 return sum;
}</pre>

# Locality Example #2

Assume:

- Line size = 32B (big enough for four 64-bit words) Matrix dimension (N) is very large
- •Approximate 1/N as 0.0

Cache is not even big enough to hold multiple rows Analysis Method:

Look at access pattern of inner loop



## Matrix Multiplication





## Matrix Multiplication Example

## Description:

- Multiply N x N matrices O(N<sup>3</sup>) total operations N reads per source element N values summed per destination
- but may be able to hold in register



## Matrix Multiplication (ijk)



Misses per inner loop iteration:

<u>A</u><u>B</u><u>C</u> 0.251.0 0.0

## Matrix Multiplication (jik)





Misses per inner loop iteration: <u>A B C</u> 0.251.0 0.0

## Matrix Multiplication (kij)





Misses per inner loop iteration: <u>A B C</u> 0.0 0.250.25

## Matrix Multiplication (ikj)





Misses per inner loop iteration: <u>A</u> <u>B</u> <u>C</u> 0.0 0.250.25

## Matrix Multiplication (jki)



Misses per inner loop iteration: <u>A B C</u> 1.0 0.0 1.0

## Matrix Multiplication (kji)





Misses per inner loop iteration: <u>A B C</u> 1.0 0.0 1.0

# Summary of Matrix Multiplication

k as inner loop

j as inner loop

i as inner loop

IT UNIVERSITY OF COPENHAGEN

```
for (i=0; i<n; i++) {</pre>
  for (j=0; j<n; j++) {</pre>
   sum = 0.0;
   for (k=0; k<n; k++)
     sum += a[i][k] * b[k][j];
   c[i][j] = sum;
for (k=0; k<n; k++) {</pre>
 for (i=0; i<n; i++) {
  r = a[i][k];
  for (j=0; j<n; j++)</pre>
   c[i][j] += r * b[k][j];
for (j=0; j<n; j++) {</pre>
 for (k=0; k<n; k++) {
   r = b[k][j];
   for (i=0; i<n; i++)</pre>
    c[i][j] += a[i][k] * r;
```

ijk (& jik):

- 2 loads, 0 stores
- misses/iter = 1.25

kij (& ikj):

- 2 loads, 1 store
- misses/iter = 0.5

jki (& kji):

- 2 loads, 1 store
- misses/iter = 2.0

## Core i7 Matrix Multiply Performance





Attacks: Side-Channel

# Sharing is a Vulnerability

attackers exploit sharing.

if A & B share E, A can **observe/affect** B through E.

different attacks depending on what is shared

- hardware
- network
- physical world (air-gap)





"somebody toucha my spaghet!" Attacks: Side-Channel - Hardware

# Imitating the Ideal

ideal computer: infinite cores, infinite memory.

#### fake it

- OS multitasking time-sharing
- memory hierarchy space-sharing

important process requirement: *isolation*. processes <u>share</u> resources.



isolation can be violated! (Unintended communication/interference)

Attacks: Side-Channel - Hardware

# How It WorksWorldSherlockCan I use that resource?

#### Why not?

Bob's process is using it.



Why is Bob's process using it?

Because Bob's process took the then-branch (not else) in procedure-

Aha! From this, I conclude...!

Attacks: Side-Channel - Hardware - CPU

## **CPU Timing Attacks**

table of processes, task manager

# **Attack:** Process A monitors the CPU load of Process B.

- High CPU load  $\Rightarrow 1$
- Low CPU load  $\Rightarrow 0$

Attack: Race conditions.

who writes to storage first

| ×                                                                                                                                                                                                                                                                                                                                                                              | ł  | •    |     |    | top     |        |        |   |      |      |          |                   | Q | r <sub>w</sub> |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|------|-----|----|---------|--------|--------|---|------|------|----------|-------------------|---|----------------|
| +                                                                                                                                                                                                                                                                                                                                                                              | ×  |      | top |    |         |        |        |   |      |      |          |                   |   |                |
| top - 12:19:31 up 5 days, 21:50, 1 user, load average: 1.78, 1.51, 1.48<br>Tasks: 363 total, 1 running, 361 sleeping, 0 stopped, 1 zombie<br>%Cpu(s): 23.5 us, 2.3 sy, 0.0 ni, 74.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st<br>KIB Mem : 16366048 total, 3774600 free, 10387952 used, 2203496 buff/cache<br>KIB Swap: 16715772 total, 15980796 free, 734976 used, 5009040 avail Mem |    |      |     |    |         |        |        |   |      |      |          |                   |   |                |
| P.                                                                                                                                                                                                                                                                                                                                                                             | ID | USER | PR  | NI | VIRT    | RES    | SHR    | S | %CPU | %MEM | TIME     | + COMMAND         |   |                |
|                                                                                                                                                                                                                                                                                                                                                                                |    | jack |     |    | 2296232 |        |        |   |      |      |          | 1 insync          |   |                |
|                                                                                                                                                                                                                                                                                                                                                                                |    |      |     |    | 3586296 |        |        |   |      |      |          |                   |   |                |
|                                                                                                                                                                                                                                                                                                                                                                                |    |      |     |    | 1423868 |        |        |   |      |      |          |                   |   |                |
| 100000000                                                                                                                                                                                                                                                                                                                                                                      |    |      |     |    |         |        |        |   |      |      | 215:57.7 |                   |   |                |
|                                                                                                                                                                                                                                                                                                                                                                                |    |      |     |    |         |        |        |   |      |      | 198:28.3 |                   |   |                |
|                                                                                                                                                                                                                                                                                                                                                                                |    |      |     |    | 3399668 |        |        |   |      |      |          | 6 firefox         |   |                |
| 76                                                                                                                                                                                                                                                                                                                                                                             |    |      |     |    |         |        |        |   |      |      |          |                   |   |                |
| 52                                                                                                                                                                                                                                                                                                                                                                             |    |      |     |    |         |        |        |   |      |      |          | 6 pantheon-termin |   |                |
| 184                                                                                                                                                                                                                                                                                                                                                                            | 45 |      |     |    |         |        |        |   |      |      |          | 8 clementine      |   |                |
| 155                                                                                                                                                                                                                                                                                                                                                                            |    |      |     |    | 3683056 | 1.035g | 108452 |   |      |      | 462:19.5 | 6 Web Content     |   |                |
| 17                                                                                                                                                                                                                                                                                                                                                                             |    |      | -51 |    |         |        |        |   |      |      | 83:43.2  | 4 irq/50-nvidia   |   |                |
| 157                                                                                                                                                                                                                                                                                                                                                                            |    |      |     |    | 2915452 | 616724 | 101792 |   |      |      | 26:23.8  | 9 Web Content     |   |                |
| 27                                                                                                                                                                                                                                                                                                                                                                             |    |      |     |    | 718868  | 26412  | 11320  |   |      |      | 2:03.2   |                   |   |                |
| 177                                                                                                                                                                                                                                                                                                                                                                            | 43 | jack | 20  | 9  | 4427280 | 2.291g | 34880  | S | 0.7  | 14.7 | 9:59.4   | 6 gimp-2.9        |   |                |

#### Attacks: Side-Channel - Hardware - Cache

## Processes Share a Cache



different address space, though. How <u>do</u> they communicate?

Fig. 3. LLC-based covert channel attack scenario.

#### Attacks: Side-Channel - Hardware- Cache

## Cache Timing Attack: Prime+Probe



estimate nr. of cache misses w/ a timer.

•

(remember memory hierarchy)

#### shameless self-promotion

goal: tools that developers can use to write secure SW.

sample research (past supervisions):

- analyze binaries for information leaks
- reduce timing leaks in the Linux kernel
- automatically fix vulnerabilities in JavaScript
- automatically generate (i.e. synthesize)
   a secure program from formal specification
- assess privacy risk in analytics programs (data scientists; Google search for "Privugger")

I like code, and I like proofs.

I created the "Applied Information Security" course. I'm a barista in Analog.



# Take-Aways

- Locality in space / time is crucial for performance and scalability.
- Analyzing locality requires an understanding of (i) the memory hierarchy / cache memories, (ii) the layout of data structures in memory, and (iii) how loops lead to reuse of data in space and time.
- Performance & Security are fundamentally at odds (sharing vs isolation)