# A Detailed GPU Cache Model Based on Reuse Distance Theory

Cedric Nugteren, <u>Gert-Jan van den Braak</u>, Henk Corporaal **Eindhoven University of Technology (Netherlands)** 

Henri Bal Vrije Universiteit Amsterdam (Netherlands)

TU/e Technische Universiteit Eindhoven University of Technology













## **Background: reuse distance theory**

### Example of reuse distance theory:

- For sequential processors
- At address or at cache-line (e.g. 4 items) granularity

|                          |                                         |                                                |                        |                             | _                 |                           |                                              |
|--------------------------|-----------------------------------------|------------------------------------------------|------------------------|-----------------------------|-------------------|---------------------------|----------------------------------------------|
| access                   | x[0]                                    | x[5]                                           | x[3]                   | x[9]                        | x[3]              | x[3]                      | x[5]                                         |
| address                  | 0                                       | 5                                              | 3                      | 9                           | 3                 | 3                         | 5                                            |
| distance                 | $\infty$                                | $\infty$                                       | $\infty$               | $\infty$                    | 1                 | 0                         | 2                                            |
| cache-line               | 0                                       | 1                                              | 0                      | 2                           | 0                 | 0                         | 1                                            |
| distance                 | $\infty$                                | $\infty$                                       | 1                      | $\infty$                    | 1                 | 0                         | 2                                            |
| distan<br>freque         | ce 0<br>ncy 1<br>(at cache-lir<br>3 com | 1 2<br>2 1<br>be granularity)<br>bulsory misse | ∞<br>3<br>↓<br>s (42%) | frequency (%)<br>0 10 20 30 | 1<br>reuse distan | example ca<br>with 2 cach | ache<br>ne-lines<br>1 capacity<br>miss (14%) |
| HPCA-20   GPU Cache Mode | I   Gert-Jan van d                      | en Braak   Februar                             | y 2014                 |                             | reuse distar      | ice                       |                                              |



| 1. Parallel                                                                     | exe                                                   | ecu                              | tion     | m                        | ode           | el                      |     |      |        |
|---------------------------------------------------------------------------------|-------------------------------------------------------|----------------------------------|----------|--------------------------|---------------|-------------------------|-----|------|--------|
| Sequentialised (<br>1 thread per<br>4 threads, ex<br>Cache-line s<br>Assume rou | GPU e<br>warp, 1<br>ach 2 lo<br>size of 4<br>nd-robir | exect<br>core<br>ads: ₂<br>eleme | ution e  | exam<br>a] and<br>or now | ple:<br>d x[2 | 2*tid-                  | +1] | time |        |
| instruction                                                                     | 0                                                     | 0                                | 0        | 0                        | 1             | 1                       | 1   | 1    |        |
| thread ID                                                                       | 0                                                     | 1                                | 2        | 3                        | 0             | 1                       | 2   | 3    | (int   |
| address                                                                         | 0                                                     | 2                                | 4        | 6                        | 1             | 3                       | 5   | 7    |        |
| cache-line                                                                      | 0                                                     | 0                                | 1        | 1                        | 0             | 0                       | 1   | 1    | divide |
| distance                                                                        | $\infty$                                              | 0                                | $\infty$ | 0                        | 1             | 0                       | 1   | 0    | 9 by 4 |
| HPCA-20   GPU Cache Model   Gert-Jan van c                                      | den Braak   Fe                                        | bruary 201                       | 4        |                          | frequency     | 75%<br>50%<br>25%<br>0% |     |      |        |



## 1. Parallel execution model

### Sequentialised GPU execution example:

- 1 thread per warp, 1 core
- 4 threads, each 2 loads: x[2\*tid] and x[2\*tid+1]
- Cache-line size of 4 elements





| 2. Memo                        | ry l      | ate          | nci      | es        |     |                      |                      |    |                            |             |    |
|--------------------------------|-----------|--------------|----------|-----------|-----|----------------------|----------------------|----|----------------------------|-------------|----|
| • 4 threads                    | s, each   | 2 loads      | 3: x[2   | *tid]     | and | x[2*                 | tid+                 | 1] |                            |             |    |
| <ul> <li>Cache-lin</li> </ul>  | ne size   | of 4 ele     | ements   | ;         |     |                      |                      |    |                            |             |    |
| <ul> <li>Fixed late</li> </ul> | ency o    | f 2 `time    | -stam    | ps'       |     |                      |                      |    |                            |             |    |
| time                           | 0         | 1            | 2        | 3         | 4   | 5                    | 6                    | 7  | 8                          | 9           |    |
| instruction                    | 0         | 0            | 0        | 0         | 1   | 1                    | 1                    | 1  | ١                          | _<br>boforo |    |
| thread ID                      | 0         | 1            | 2        | 3         | 0   | 1                    | 2                    | 3  | <b>] ] -</b> <sup>as</sup> | -           |    |
| address                        | 0         | 2            | 4        | 6         | 1   | 3                    | 5                    | 7  | -                          | -           |    |
| cache-line                     | 0         | 0            | 1        | 1         | 0   | 0                    | 1                    | 1  | -                          | -           |    |
| cache effect                   | -         | -            | 0        | 0         | 1   | 1                    | 0                    | 0  | 1                          | 1           |    |
| distance                       | $\infty$  | $\infty$     | $\infty$ | $\infty$  | 0   | 1                    | 0                    | 1  | -                          | -           |    |
| latency                        | 2         | 2            | 2        | 2         | 2   | 2                    | 2                    | 2  | -                          | -           |    |
| effect at                      | 2         | 3            | 4        | 5         | 6   | 7                    | 8                    | 9  | -                          | -           |    |
| Note: Extra 'compulso          | ry' misse | es are calle |          | cy misses |     | 7:<br>22<br>22<br>22 | 5%<br>0%<br>5%<br>0% | 1  | 2                          |             | 14 |



| 2. Memor                                | y la        | ten         | cies     | 5             |      |              |                |     |    |
|-----------------------------------------|-------------|-------------|----------|---------------|------|--------------|----------------|-----|----|
| • 4 threads,                            | each 2      | loads:      | x[2*t:   | id] <b>an</b> | d x[ | 2*tid+1]     |                |     |    |
|                                         | SIZE OF     | 4 eiem      | ents     |               |      |              |                |     |    |
| • Variable la                           | tency of    | f 2 (mis    | ses) an  | d 0 (hits     | S)   | _            |                | _   |    |
| time                                    | 0           | 1           | 2        | 3             | 4    | 5            | 6              | 7   | -  |
| instruction                             | 0           | 0           | 0        | 0             | 1    | 1            | 1              | 1   |    |
| thread ID                               | 0           | 1           | 2        | 3             | 0    | 1            | 2              | 3   |    |
| address                                 | 0           | 2           | 4        | 6             | 1    | 3            | 5              | 7   |    |
| cache-line                              | 0           | 0           | 1        | 1             | 0    | 0            | 1              | 1   |    |
| cache effect                            | -           | -           | 0        | 0             | 1    | 0 1 0        | 1              | 1   |    |
| distance                                | $\infty$    | $\infty$    | $\infty$ | $\infty$      | 0    | 0            | 1              | 0   |    |
| hit/miss                                | m           | m           | m        | m             | h    | h            | h              | h   |    |
| latency                                 | 2           | 2           | 2        | 2             | 0    | 0            | 0              | 0   |    |
| effect at                               | 2           | 3           | 4        | 5             | 4    | 75%<br>\$50% |                |     |    |
|                                         |             |             |          |               |      | and 25%      |                |     |    |
| HPCA-20   GPU Cache Model   Gert-Jan va | n den Braak | February 20 | 14       |               |      | ₩ 0 1<br>re  | 2<br>use dista | nce | 16 |



| . MSHRs                                                                      |                                |                                    |         |          |          |      |       |
|------------------------------------------------------------------------------|--------------------------------|------------------------------------|---------|----------|----------|------|-------|
| <ul> <li>2 out of the 4</li> <li>Cache-line s</li> <li>Only 1 MSH</li> </ul> | 4 thread<br>size of 4<br>R pos | s, each 2 I<br>elements<br>stponed | oads: > | [2*tic   | a] and   | x[2* | tid+1 |
| time                                                                         | 0                              | 1                                  | 2       | 3        | 4        | 5    | 6     |
| instruction                                                                  | 0                              | 0                                  | 1       | 0        | 1        | -    | -     |
| thread ID                                                                    | 0                              | 2                                  | 0       | 2        | 2        | -    | -     |
| address                                                                      | 0                              | 4                                  | 1       | 4        | 5        | -    | -     |
| cache-line                                                                   | 0                              | 1                                  | 0       | 1        | 1        | -    | -     |
| cache effect                                                                 | -                              | -                                  | 0 0     | -        | -        | 1    | 1     |
| distance                                                                     | $\infty$                       | $\infty$                           | 0       | $\infty$ | $\infty$ | -    | -     |
| MSHRs used                                                                   | 0                              | 1                                  | 0       | 0        | 1        | -    | -     |
| status                                                                       | miss                           | cancel                             | hit     | miss     | miss     | -    | -     |
| MSHRs used                                                                   | 1                              | -                                  | 0       | 1        | 1        | -    | -     |
| latency                                                                      | 2                              | -                                  | 0       | 2        | 2        | -    | -     |
| effect at                                                                    | 2                              | -                                  | 2       | 5        | 6        | -    | -     |























