Difference between revisions of "Causality"
(Created page with "In Sniper (as in [http://groups.csail.mit.edu/carbon/?page_id=111/ Graphite] from which Sniper is derived), each memory access is simulated to completion in a single function ca...") |
|||
(3 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | In Sniper (as in [ | + | In Sniper (as in [https://github.com/mit-carbon/Graphite Graphite] |
from which Sniper is derived), each memory access is simulated to completion in a | from which Sniper is derived), each memory access is simulated to completion in a | ||
single function call. This means that in the memory subsystem, time | single function call. This means that in the memory subsystem, time | ||
advances during the simulation of one memory access; then time is | advances during the simulation of one memory access; then time is | ||
potentially set backwards to start the simulation of the next memory | potentially set backwards to start the simulation of the next memory | ||
− | access. This makes that time inside the memory subsystem isn't always | + | access. This makes that time inside the memory subsystem (including |
+ | the on-chip network models) isn't always | ||
advancing monotonically (known affectionately as the "fluffy time" | advancing monotonically (known affectionately as the "fluffy time" | ||
problem). | problem). | ||
− | + | Additionally, different core models execute relatively independently. | |
− | + | Each core can, at the same instant in real time, be making a memory access | |
− | (the | + | at a slightly different simulated timestamp, but be accessing the same resource |
− | + | (e.g. a shared cache or an on-chip network link). | |
+ | Here again, this shared resource will see | ||
+ | operations caused by different cores to occur at time stamps that are | ||
+ | potentially out of order. | ||
+ | |||
+ | This is a big difference with many other simulators, which process events | ||
+ | in-order, usually on a cycle-by-cycle basis. This simulation method, | ||
+ | while ensuring that causality errors do not occur, is much slower | ||
+ | as it requires a lot of synchronization between simulator threads handling | ||
+ | the different models. In contrast, Sniper allows each simulated core to make | ||
+ | independent progress for a considerable length of time, but periodically | ||
+ | inserts global barriers (by default, every 1000 simulated nanoseconds) | ||
+ | to ensure an upper bound on the timing differences (skew) between | ||
+ | the different cores. This design choice greatly improves parallelism | ||
+ | inside the simulator and allows for much higher simulation speeds. | ||
+ | |||
+ | |||
+ | == Bandwidth modeling and queuing delays == | ||
+ | |||
+ | Because of this problem, the care must be taken when implementing queuing models | ||
+ | on for instance network links. The history-list queue model keeps a | ||
list of previous times when the resource was in use. Then, when a | list of previous times when the resource was in use. Then, when a | ||
request arrives for an earlier time stamp, this time stamp can be | request arrives for an earlier time stamp, this time stamp can be | ||
− | looked up in the history list, and the model can determine | + | looked up in the history list, and the model can determine whether the |
resource was free at that time, and if not, when the earliest time is | resource was free at that time, and if not, when the earliest time is | ||
− | when the resource is free which determines the | + | when the resource is free which determines the queuing delay. |
Of course, even the history list queue model cannot handle actual | Of course, even the history list queue model cannot handle actual | ||
causality errors, which occur when requests earlier in simulated time | causality errors, which occur when requests earlier in simulated time | ||
affect later requests -- but the former are simulated later in | affect later requests -- but the former are simulated later in | ||
− | + | wall clock time (because they were generated by a core that was lagging | |
− | behind). | + | behind). Still, we found that this system works well |
− | |||
enough for memory subsystem trade-off studies -- while yielding | enough for memory subsystem trade-off studies -- while yielding | ||
significantly faster simulation speeds. | significantly faster simulation speeds. | ||
+ | An exploration of this and other synchronization methods, and their | ||
+ | accuracy compared to real hardware, can be found in our | ||
+ | [[Paper:Sc2011Carlson|SC'11 paper]], Figure 12. |
Latest revision as of 01:05, 21 July 2020
In Sniper (as in Graphite from which Sniper is derived), each memory access is simulated to completion in a single function call. This means that in the memory subsystem, time advances during the simulation of one memory access; then time is potentially set backwards to start the simulation of the next memory access. This makes that time inside the memory subsystem (including the on-chip network models) isn't always advancing monotonically (known affectionately as the "fluffy time" problem).
Additionally, different core models execute relatively independently. Each core can, at the same instant in real time, be making a memory access at a slightly different simulated timestamp, but be accessing the same resource (e.g. a shared cache or an on-chip network link). Here again, this shared resource will see operations caused by different cores to occur at time stamps that are potentially out of order.
This is a big difference with many other simulators, which process events in-order, usually on a cycle-by-cycle basis. This simulation method, while ensuring that causality errors do not occur, is much slower as it requires a lot of synchronization between simulator threads handling the different models. In contrast, Sniper allows each simulated core to make independent progress for a considerable length of time, but periodically inserts global barriers (by default, every 1000 simulated nanoseconds) to ensure an upper bound on the timing differences (skew) between the different cores. This design choice greatly improves parallelism inside the simulator and allows for much higher simulation speeds.
Bandwidth modeling and queuing delays
Because of this problem, the care must be taken when implementing queuing models on for instance network links. The history-list queue model keeps a list of previous times when the resource was in use. Then, when a request arrives for an earlier time stamp, this time stamp can be looked up in the history list, and the model can determine whether the resource was free at that time, and if not, when the earliest time is when the resource is free which determines the queuing delay.
Of course, even the history list queue model cannot handle actual causality errors, which occur when requests earlier in simulated time affect later requests -- but the former are simulated later in wall clock time (because they were generated by a core that was lagging behind). Still, we found that this system works well enough for memory subsystem trade-off studies -- while yielding significantly faster simulation speeds. An exploration of this and other synchronization methods, and their accuracy compared to real hardware, can be found in our SC'11 paper, Figure 12.