Frequently Asked Questions

From Sniper
Jump to navigation Jump to search

Q: How is the memory subsystem in Sniper different from other simulators?

A: The memory hierarchy in Sniper resolves memory requests immediately. This allows Sniper to run at higher speeds as it does not require the simulator to be run in lock step every cycle. By default, we use barrier synchronization with a fairly low interval (1000ns) to maintain accuracy, as relaxed synchronization (--clock_skew_minimization/scheme=none) can result in large errors. See our SC11 paper for more details on synchronization options and the speed versus accuracy trade-offs. For more details on how Sniper's memory subsystem works, see our causality information page.

Q: How good is the accuracy of Sniper compared to real hardware?

A: In our SC11 paper we describe the validation of the Sniper simulator against the Core 2 (Dunnington) microarchitecture. Nevertheless, between the time that we published that paper and the first open source version of Sniper had been released, there have been many changes to the simulator. A few changes were made that were a requirement for release (our original micro-op decoder was not MIT licensed, and therefore we needed to rewrite it before release, for example). There have also been a few changes with respect to the microarchitecture modeled. For example, our first release did not include Core 2 configuration files and settings, but was the beginning of a migration to the Nehalem microarchitecture. Note that many of the configuration settings for a microarchitecture are configurable via config-file options, but many were also hard-coded in the source-code. Therefore, the released version had many hard-coded changes made to the source which were additions to support Nehalem (we are currently working towards making most of these options config-file configurable, but we aren't quite there yet). We are currently in the process of validating the Nehalem models, but, of course, this is a very time-consuming endeavor. From a core perspective, we are quite happy with the results that we are getting when comparing microbenchmarks that use the most common floating-point and integer instructions. Nevertheless, we are still working to increase the accuracy of all of the components of the microarchitecture. We will update this page with more information on specific validation results in the coming months. We are trying our best to make an accurate simulator, and your feedback is appreciated to improve it's accuracy. Please tell us if you find any issues (either incorrect models or bugs) so we can fix them right away. No simulator is perfect, but our goal is to try to get as close as possible. The feedback so far has been great, thanks.

Q: How can I configure core frequencies/DVFS in Sniper?

A: There are a number of ways to configure the frequencies of processors in Sniper. The easiest way is to set the default startup frequency for all cores is via the run-sniper parameter -g --perf_model/core/frequency=1.0. If you would like to control the per-core frequencies, first define a set of frequency domains (one per core) with -g --dvfs/type=simple -g --dvfs/simple/cores_per_socket=1. Then, in the application itself, you can call either SimSetFreqMHz(core_id, int_freq_in_mhz) or SimSetOwnFreqMHz(int_freq_in_mhz) (these SimAPI calls are defined in sniper/include/sim_api.h). If modification of the application is not possible, or you would like to define the per-core frequency at runtime, you can use the dvfs.py script. An example might look like the following:

user@host:~/sniper/test/fft$ ../../run-sniper -n 4 -c gainestown --roi -sdvfs:0:0:1000:0:1:2000:0:2:3000:0:3:4000 \
-g --dvfs/type=simple -g --dvfs/simple/cores_per_socket=1 -- ./fft -p 4

Where the scripting interface is called with -s<script>:<options> and the dvfs.py script accepts parameters in the form of lists of 3-tuples, defined as (ns_time_to_change_freq, core_id, int_freq_in_mhz). The example above will set the first 4 cores to 1, 2, 3 and 4 GHz at the start of ROI.

In Sniper 3.0, it is now possible to set core frequencies using heterogeneous configuration support. For example:

./run-sniper -n 4 -c gainestown --roi -g --dvfs/type=simple -g --dvfs/simple/cores_per_socket=1 -g --perf_model/core/frequency=1,2,3,4

This sets cores 0 through 3 to 1 to 4 GHz. Note, these frequencies are represented as floating point numbers of GHz, while the SimAPI and DVFS scripts use integers in MHz. Additionally, a bug in the DVFS handling of Sniper 3.0 requires cores_per_socket to be set to 1 when using heterogeneous configuration. This will be fixed in the next release.


Q: Why do the cache statistics in Sniper for a particular application not agree with my HW performance counters?

A: There are two major reasons why the cache numbers might be different when compared to hardware. The first reason is how the cache counters collect data in Sniper. The cache access rates should look comparable to real hardware, but the miss rate can in some cases be rather different. The reason for this is that the overlapping misses in Sniper are counted as hits, while on real hardware they would count as cache misses. Internally, Sniper's memory subsystem completes each access, gets the result immediately, and uses a queuing model to determine contention. Therefore, a miss in real hardware would be a hit in Sniper. (The timing in Sniper is still modeled correctly, as the second memory access would be dependent on the first one so it cannot complete before it.) One way to compare the number of L1-D cache misses that we see in Sniper with hardware would be to compare that number to the number of L2-D accesses (made by the same core). The number of L2-D cache accesses would represent the non-overlapped L1-D misses, which is how the statistics in Sniper are counted. The second reason for differences is because there can be a number of hardware structures or hardware limitations that are either unknown to us, or that we do not model completely.

Q: Why does the CPI-stack format that I generate with Sniper differ from the SC11 paper results?

A: We have recently updated the CPI stack format to better reflect system resource contention. See our recent IISWC publication for more details on these changes.

Q: Why does the TLB code in Sniper not perform the way that I expect?

A: Sniper is a user-space simulator, and therefore doesn't model all of the Hardware-Operating System interactions that one might expect to see. This is because the applications that we are targeting, HPC workloads, tend to see very few TLB misses. As an experiment, we looked into modeling the OS effects of TLB misses, but only from the perspective of OS-handled TLB misses. To use this, one would have to set the TLB size to the last-level TLB of the architecture that you are modeling, and set the miss penalty to 100s of cycles to account for the OS penalty. Modeling L1-TLBs is possible but is not currently implemented. Modifications to the memory subsystem to report TLB misses as a part of the load and store access times would be necessary to get this working properly.

Q: How can I have multiple regions of interest?

A: Sniper expects there to be only one set of ROI begin/end markers. There are however ways of marking different code regions and getting separate statistics for each of them, see Multiple regions of interest.

Q: What are the license terms for using Sniper?

A: In short, the interval core model is protected under a US patent application. We automatically grant you a free license for using the interval model inside Sniper for academic purposes. For commercial use, please contact Lieven Eeckhout. All other code is licensed under the very liberal MIT license. You can view the full details on our License page.