State of the Art of Performance Visualization

Katherine E. Isaacs\textsuperscript{1}, Alfredo Giménez\textsuperscript{1}, Ilir Jusufi\textsuperscript{1}, Todd Gamblin\textsuperscript{2}, Abhinav Bhatel\textsuperscript{2}, Martin Schulz\textsuperscript{2}, Bernd Hamann\textsuperscript{1}, and Peer-Timo Bremer\textsuperscript{2}

\textsuperscript{1}Department of Computer Science, University of California, Davis
\textsuperscript{2}Lawrence Livermore National Laboratory

Abstract
Performance visualization comprises techniques that aid developers and analysts in improving the time and energy efficiency of their software. In this work, we discuss performance as it relates to visualization and survey existing approaches in performance visualization. We present an overview of what types of performance data can be collected and a categorization of the types of goals that performance visualization techniques can address. We develop a taxonomy for the contexts in which different performance visualizations reside and describe the state of the art research pertaining to each. Finally, we discuss unaddressed and future challenges in performance visualization.

Categories and Subject Descriptors (according to ACM CCS): I.3.3 [Computer Graphics]: Picture/Image Generation—Line and curve generation

1. Introduction
High performance computing (HPC) simulations drive innovation across a wide range of scientific fields, including astrophysics, climate simulation, material science, combustion, and energy production. Numerical problems in these disciplines would take hundreds of years to compute without massively parallel machines. To shape the development of future fast, power-efficient architectures, and to accelerate the pace of computational science, it is critical to gain a comprehensive understanding of the factors that affect performance and power consumption on HPC systems.

Optimizing the performance of parallel applications is not straightforward, and performance analysis has become increasingly complex. Programs now must take advantage of multicore processors, programmable Graphics Processing Units (GPUs), and multi-level non-uniform memory hierarchies. On-node performance counters and instrumentation tools allow detailed performance measurements, but the profusion of data they generate when applied to parallel programs makes exploring and understanding the data difficult.

This highlights the need for performance visualization techniques. We present an overview of performance visualization and survey existing work. Our contributions are:

\begin{itemize}
  \item A classification of the goals of developers and analysts who use performance visualization.
  \item A context-based classification and survey of existing performance visualizations.
  \item A discussion of challenge areas in developing new and more powerful performance visualizations.
\end{itemize}

Others [MMC02, KS93] have reviewed software visualization techniques, but tended to lump visualizations focusing on performance into a single category. We focus solely on performance, avoiding other software visualization areas such as software evolution, programming environments, visual programming, and software design. Performance tends to overlap with debugging and general program comprehension, so we include work from those areas as appropriate.

2. Performance Data
We detail methods for acquiring performance data and the types of performance data that can be generated. Many tools can be used to record performance measurements [GKM04, NS07, Rei05, MBDH99, SM06, BM11], allowing visualization developers to gather their own datasets.
2.1. Methods for Acquiring Performance Data

2.1.1. Instrumentation

Instrumentation is the act of modifying a program for an alternative purpose: in this case, for acquiring performance data. At the most basic level, an instrumentation tool inserts extra code into a program’s control flow. The instrumentation code may record timer values, or it may perform more complex analysis, such as writing out program variable values and recording variable accesses and conditions met. Instrumentation can be applied to source code before compiling, or it can be applied at runtime using binary modification [BM11] or sampling [ABF10]. Care must be taken to ensure that the instrumentation does not change normal program behavior or add excessive overhead.

2.1.2. Interception

Interception is a form of instrumentation that leverages function calls already present in program source code. Interceptor functions are typically grouped together into a library, which is then linked with a program, either dynamically or statically. The program’s original function calls are linked to the interception library, which executes special measurement code, then delegates to the original implementation of the intercepted function. Interception is useful for profiling libraries because it can record the dynamic values of parameters passed to library calls. This can give more semantic context to measurements. For example, communication performance of many parallel programs is often measured by intercepting calls to the Message Passing Interface (MPI), and interceptor calls can differentiate between send and receive operations based on the size of data passed to them. As with other types of instrumentation, interception must be used sparingly to avoid incurring overhead.

2.1.3. Profiling and Tracing

Profiling and tracing are measurement techniques that determine where a single execution of a program spends its time. Profiling tools, such as gprof [GKM04] and VTune [Rei05], pause execution of a program repeatedly over a specified sampling period and record the contents of either the instruction pointer or the entire call stack. At the end of a profiling run, samples are analyzed to determine the percentage of time spent in each part of the code. Profiles lose temporal information but quickly identify key bottlenecks in a program. Tracing is similar to profiling in that it measures a program’s execution, but it records a detailed time line of when events occurred. For example, a trace might record function entry and exit times for an entire run. Because it does not aggregate over time, recorded traces can require large amounts of memory, which can cause excessive overhead compared to profiling. TAU [SM06], Vampir [NAW+96], and EPILOG [WMI04] provide resources for creating and dumping traces of programs that use MPI.

Profiling trades off comprehensive data for low overhead, while tracing provides complete data on runtime events at the cost of much higher overhead.

2.2. System Monitoring

The measurements discussed so far are application-level measurements, in that they measure the performance of a single application process. We can also acquire system-wide performance information during a program’s execution by executing external processes, enabling system-wide counters (described in Section 2.3.1), or gathering other metadata at runtime. Such methods require no modification to the code or executables involved, and as such are simple to use. However, this kind of data is generally very coarse and semantically low-level. Further, because data is collected outside the measured application process, it can be difficult to attribute measurements to the target program’s source code.

2.3. Types of Performance Data

2.3.1. Counters

A counter is a special hardware register that accumulates the number occurrences of a specified event over time. These can be either software events, such as system calls, or hardware events, such as floating-point operations, cache misses, and packets received over a network link. The complete set of countable events is specific to the platform being used but is generally quite extensive. Commonly, counters are either instrumented to initialize and terminate around a block of code designated for analysis or run system-wide during program execution. PAPI [MBDH99] provides a portable interface to specify, initialize, terminate, and read out counters.

The overhead and precision of performance counter measurements depends on how frequently they are sampled. Sampling performance counters too frequently gives high overhead, which can limit precision and make attributing counted events to particular instructions difficult.

Counter data most directly benefits visualizations in the hardware (Section 5) and software (Section 6) contexts. For example, in a network visualization, packet counters can be recorded per-link to visualize network traffic. Other hardware counters can be mapped directly to the resource (CPU, memory, etc.) responsible for generating the event. Counters measured within instrumentation can also be attributed indirectly to the instrumented code, giving software context.

2.3.2. Hardware Samples

Traditional accumulative hardware counters have been extended to provide more precise and detailed information about particular instructions. Instead of simply incrementing a counter, modern hardware performance units can write detailed information about an instruction’s execution, including its precise instruction pointer, progress within the processor’s pipeline, total latency, and more. Intel and AMD
processor architectures both include hardware capabilities to measure memory loads and stores, and Intel in addition provides a capability to sample branching events [Int07, DC07].

Hardware sampling provides finer granularity with low overhead because it is implemented as part of a microprocessor. Tools still need to conduct detailed analysis to attribute such samples to program source code.

### 2.3.3. Traces and Call Paths

Trace files contain lists of timestamped point events recorded during program execution. These events can include procedure entry and exit, message sends and receives, and object acquisitions and releases. By following function entries and exits, the call stack at any point in time can be derived. These events may also be associated with certain hardware elements like memory addresses or particular CPUs.

In some parallel environments, there may be one trace file generated for each process or thread, so trace data size typically scales with the number of concurrent tasks. Parallel systems may not guarantee high resolution clock synchronization, resulting in some inaccuracy in event timestamps.

Depending on the features in the tracing tool and the options selected by the user, more or less information can be included – for example, message sizes with the sends and receives or parameters with the procedures. Some tools can also record counter values with each event.

### 3. Performance Goals

The main goal of performance analysis and thus in performance visualization is to make the application execute faster or use less power. There are several sub-goals on the road to efficiency that have utilized visualization. In this section we discuss these goals, dividing them into three main categories: global comprehension, problem detection, and diagnosis and attribution.

#### 3.1. Global Comprehension

Often the first step in optimizing an application is understanding the big picture regarding what occurred during an execution. When specific targets for optimization are unknown, analysts must narrow down regions of interest from the whole application. Global comprehension goals also exist so users can get a sense of normal behavior as well as compare predicted and achieved performance. Visualizations that present a strong overview or allow for pattern matching may be particularly useful here.

The tasks involve understanding program structures and resource utilization. Program structures include phases of execution, algorithms, data structures, communication patterns, data motion, access patterns, and data dependences. Resource utilization includes the magnitude and distribution of demands on processors, memory, and the network. Understanding the intricate relationships between these different aspects of program behavior forms the necessary foundation for identifying and understanding the performance of an execution.

#### 3.2. Problem Detection

Visualization can help developers detect performance problems such as anomalous behavior, performance bottlenecks, load imbalance, and resource usage issues. Outlier detection, pattern detection, focus+context features, and dependency tracking can aid in finding problems.

Anomalous behavior includes deadlocks, livelocks, data race issues, or unexpected behavior. Bottlenecks and imbalance may exhibit similar symptoms like outlier computation or message durations and significant idle times. These problems could also be detected by recognizing network congestion or memory contention or by characteristics of the critical path through the execution.

Resource misuse includes low parallelism and false parallelism, where many threads are created unnecessarily. Synchronization may also be unnecessary and impede performance. Another resource issue could be poor locality in data accesses.

#### 3.3. Diagnosis and Attribution

Diagnosis of a problem may follow directly from detection or be more subtle. Problems may be attributed to software, relationships with lines of code, variables, data structures, or third party libraries. This may be a step in the process of recognizing poor distribution or division of work, a sub-optimal algorithm or data structure, or a better overlap of messaging and computation. In distributed and parallel systems, the mapping of tasks to the system can likewise be an issue.

Problems may also be attributed to the system on which the code is run. Operating system effects, memory or scheduling policies, and network routing algorithms may contribute to poor performance. Finding these effects gives developers the information necessary to take any steps they can to ameliorate them.

Highlighting the true sources of inefficiency can be difficult. Linking and correlation, pattern detection, dependency tracking, and ensemble comparison features may aid in achieving these goals.

### 4. Taxonomy

We organize our survey by the main context represented by the visualization. By context, we refer to the concepts onto which the data is mapped and of which the visualization is constructed. In some cases, this context can be derived directly from recorded data, such as a visualization focusing...
on a specific data column. In other cases, the context may require some form of additional input about the environment from which the data was collected, such as structure mined from the source code or the graph of a distributed system. Sometimes this information is assumed and hard-coded into the visualization. We define four major contexts in performance visualization: hardware, software, tasks, and application.

Hardware is the natural context for data collected from performance counters, as these are associated with individual hardware elements like nodes, cores, or links. Additional context in hardware includes the hierarchical grouping of the elements, the network topology of these elements, and queues, scheduling and interfaces associated with these elements (even if they may be implemented by low-level software).

Software covers contexts related to a program’s source code. This includes static information such as the class structure of the program and individual variables as well as dynamic data associated with executions, such as call graphs, developed data structures, and program flow.

Tasks contexts involve the individual tasks performing the computation. Tasks contexts exist at many levels of granularity: Processing elements, threads, and processes are fractions of a single program. Jobs and commands represent entire programs that may share a system. Note that processing elements form a tasks context when viewed as mostly anonymous actors performing work, but they form a software context when considered as specific (and largely different) objects interacting in an object-oriented program.

Application refers to the context of what is actually being computed. In scientific simulations, this is often bounded physical space. Another common application context is the set of matrices used in matrix libraries.

Some visualizations draw contexts from multiple categories. For example, a tasks layout may be influenced by the underlying topology of the processors on which they run. In these situations, we classify the visualization by the dominant context, but make mention of the additional contexts used. In the specific case of contexts related to the operating system (OS), we generally classify them under hardware with the justification that the OS is typically not programmable to the extent that the software is.

Figure 1 provides an alternative picture, organizing the most recent visualizations by complete tool rather than by individual view as we discuss in the following sections. This shows which tools cover multiple contexts, as well as which goals they address and what sizes of problems they handle.

5. Hardware Visualization

Visualizations in the hardware context create a visual representation of the hardware on which an application code is run. Often these visualizations map data from performance counters onto a depiction of the hardware from which the data originated. Building representations of different hardware requires developing an intuitive metaphor for the topology of the hardware. An effective hardware metaphor decomposes the hardware into its basic elements while retaining its unique characteristics necessary for performance analysis. These techniques aim to illustrate complex hardware topology, identify hardware-based performance problems, and show the relationship between software and hardware.

We categorize visualizations in the hardware context into those depicting the computing network and those depicting individual compute nodes.

5.1. Network

Supercomputing nodes are connected via a network. The system can be interpreted as a graph where vertices are nodes and edges are network links. The performance visualizations therefore often take the form of graph visualizations. Because network topologies vary so widely in their structures, a challenge in creating network visualizations is that each must be highly tailored to a specific topology. Tree-based networks, such as fat-trees, lend themselves to hierarchical visualizations, while others, such as torus and hypercube networks, lend themselves to complex graph layouts and dimension-reducing projections.

A common general representation of the network graph is an adjacency matrix with computation nodes as the x and y axes sometimes referred to as a communication ma-
Figure 1: Classification of recent visualizations by context, scale, and goal. We limit the scale and goal to what was reported (rounded); in practice the visualization may exceed what is listed in the chart. Any value that was not clear or missing entirely in the publications are marked not reported (NR). We focus on works published in the last 10 years.
trix [HE91]. ParaGraph [HE91] depicts communication matrices and color-codes the elements to indicate areas of heavy link traffic. Zhou and Summers [ZSC03] use an adjacency matrix to show quaternary fat-trees and depict transactions by animating 3-dimensional glyphs on matrix element locations. While communication matrices effectively show all links, they are very ineffective in showing the shape of the network and the distance between non-neighboring nodes. Furthermore, because they show all possible links, and existing links are shown twice (once for each direction), they contain much visual redundancy.

Haynes et al. [HCR01] makes another general representation by depicting all nodes in a 2-dimensional grid and color-coding all network links. As such, the user can follow links by finding matching colors, with no redundant links or wasted space. However, this technique still suffers in showing the network shape and paths with multiple links. Another issue is that humans are visually limited in discerning multiple unique colors with accuracy, which limits the size of the network this visualization can usefully represent.

Zhou and Summers [ZSC03] also use a modified 2-dimensional H-tree layout to demonstrate the network topology of the quaternary fat-tree used in a variety of HPC systems. They aggregate histograms and arcs in a third dimension to show messages passed between nodes. Mueller et al. [MSM∗11] demonstrate another hierarchical style graph visualization of the I/O network for Blue Gene/P. They depict the network in a radial layout with storage nodes in the center, compute nodes on the outside, and I/O nodes in between. The performance data is aggregated on the drawn links between the different types of nodes. Both hierarchical visualizations demonstrate the ability to discover areas of heavy communication traffic within their respective network topology. Mueller et al. [MSM∗11] depict the entirety of the performance data in a single 2-dimensional view, while Zhou and Summers [ZSC03] take advantage of encoding data in the third dimension and using animation, at the cost of occlusion and complexity. Purely hierarchical approaches are only possible for specific network topologies, but effectively reduce the visual complexity of the network graph by utilizing well-known hierarchical metaphors. We also note that the aforementioned visualizations deal with network trees of relatively shallow depth, but this has not yet been an issue because existing HPC interconnects typically do not use much deeper hierarchies.

Many visualizations lay out the network graph of specific network topologies as 2- or 3-dimensional meshes, with nodes as vertices and links as edges. Boxfish [ILG∗12] provides an interface for displaying performance data on a mesh representing a 3-dimensional torus network. Landge et al. [LLB∗12] create 2-dimensional projections of the 3-dimensional torus network in Boxfish with no occlusion (Fig. 2). Haynes et al. [HCR01] depicts, in addition to the general 2-dimensional visualizations, another 3-dimensional mesh layout of a 3-dimensional torus network. The mesh layouts create more intuitive depictions of the network, but there does not always exist an intuitive mesh projection for a network topology type. As topologies increase in dimension, such as the 5-dimensional and 6-dimensional torus, it becomes much more difficult to create a low-dimensional mesh that is easily understood.

5.2. Node

The topologies of CPU nodes are often relatively small compared to network topologies; these are on the order of tens and hundreds of processors and memory resources. For this reason, parallel programs usually employ a mapping from \(N\) tasks to \(M\) processors, with \(N \gg M\). Techniques for visualizing on-node computation space often take the form of task-based visualizations [TBD10, KJL07, dKSB00, Rei05, ABF∗10]. However, such a mapping typically does not exist between tasks and memory resources, especially in the context of multi-level memory hierarchies where multiple processors share resources simultaneously. As a result, on-node hardware visualizations have mostly targeted memory address space and resource usage.

5.2.1. Processor Topology

Processor-based visualizations typically visually encode cumulative performance data per-processor. Often, the layout of processors is based on the hardware numbering and data is represented with histograms or stacked bar charts [BD01, ABF∗10]. Schulz et al. [SLB∗11] arranged processors based on the 2D layout of the application (see Section 8) and displayed values using color.

Processor topologies are often embedded within larger network visualizations to show processor resources within individual nodes. Haynes et al. [HCR01] depicted nodes with aggregated glyphs representing multiple processors on each node. Similarly, Zhou and Summers [ZSC03] showed each node as a subdivided grid with cells representing individual processors.

5.2.2. Memory Topology

Several on-node memory visualizations represent the virtual address space of memory as an infinite one-dimensional space. A program is allocated a finite subset of that space by the operating system, and all program variables lie within it. As such, many techniques [GT89, MTO7, CFA∗06] depict the space of a single program with variables as finite contiguous blocks within the program’s memory. Griswold et al. [GT89] color-code different variables and their datatypes on a line which wraps down multiple rows to more effectively utilize screen space. Moreta and Telea [MT07] expands the 1-dimensional layout to depict allocations and deallocations over time, with the address space on the vertical axis and time on the horizontal axis. They also include an overview
Other memory visualization techniques have focused on depicting properties of the memory hierarchy, e.g. multi-level caches, RAM, and disk, rather than the address space.

Alpern [ACS90] created an early visualization showing the memory hierarchy of various hardware for the purpose of observing data migration between disk, memory, translational lookaside buffer, and registers. The visualization showed the different memory resources as boxes connected by drawn links and also drew subsets of the data within the memory resource they resided. While it did not embed performance data, it created a model for understanding what occurs in hardware and how cache-optimized algorithms more efficiently utilize memory resources. hwloc [BCOM^∗10], a software package for the analysis of system attributes, provides a tool called lstopo that detects and displays the topology of different architectures in a hierarchical space-filling layout (Figure 4) but like the work of Alpern, also does not encode performance data.

Choudhury et al. [CPP08] created an interactive visualization depicting simulated memory access data embedded within diagrams of the caches, address space, and iterations. This visualization shows accesses and misses from individual cache lines and addresses. While highly detailed, it would be unfeasible to scale beyond the demonstrated number of memory resources. Rivet [BD01] displays another diagram-like visualization of different caches with per-processor memory performance data mapped to cache resources.

Choudhury and Rosen [CR11] created a more abstract representation using a radial space-filling layout, also for simulated data. They represent different levels of cache as rings around a central processor, with lower levels closer to the processor, and depict data migration between levels of cache as lines between ring segments, as seen in Figure 5. Mu et al. [MTSM03] created a visualization targeting NUMA effects on multi-socket nodes by depicting different NUMA domains and transactions between them. The visualization also includes sufficient information to map areas of NUMA transactions to source lines of code. Because they focus on a specific performance issue, the visualization is able to depict a small amount of information in a way that is directly useful in optimizing code for NUMA efficiency.

Rosen [Ros13] specifically targets the memory topology of NVIDIA graphics processors and create a visual model depicting both processor and memory layout. The visualization decomposes performance data of multiple processing units (warps) to find representative subsets with which to compare. The visualized warps include information about memory banks used by each warp for the purpose of identifying bank conflicts, which represent major memory access bottlenecks. The idea to use representative subsets is an effective way to handle the plethora of performance data and the extensively large processor topologies while retaining information about average behavior and outliers.

© The Eurographics Association 2014.
Figure 5: Memory hierarchy visualization by [CR11]. Left: Addresses are represented as points, and different sets of points represent different memory resources. The outer ring represents main memory, the outer four arcs of points represent L2 cache, and the inner two arcs represent L1 cache. Lines between points denote migration of data between resources. Right: Simulated transactions are associated with the lines of code which caused them. Image courtesy of P. Rosen.

6. Software Visualization

We survey software visualization only as it related to performance visualization. Therefore software visualization techniques applied for other purposes, such as education or software maintenance, belong outside of the scope of this work. As mentioned earlier, we define the software context as visualizations related to a program’s source code. This includes visualization of the software structures in terms of classes and packages, visualization of the code itself, serial traces of events related to method invocation, and call graphs of specific execution. Data-structure visualization tools are mainly used for education and debugging, although Heapvis [AKG∗10] offers features that could be for performance visualization.

6.1. Serial Trace Visualization

Serial trace visualization shows a sequence of events. Several different visualization metaphors have been used for the visualization of traces. One of them is a variation of icicle plots, where width encodes duration [RR99, TBD10, KTD13]. Figure 10 shows one such example; the icicle plot is located at the top of the screenshot.

A common practice in trace visualizations is to assign one of the axes to the time variable while the other axis is used to represent different processes, classes, instructions or methods [JSB97, DPH10, CHZ∗07, MHJ91, MSM∗11, RZ05]. In essence, most of these approaches represent different variations of Gantt charts. Some trace visualization tools animate the execution of events and even provide additional views for visualizing algorithms [JSB97, BBH08].

In contrast to the presented approaches, Cornelissen et al. [CHZ∗07] place the methods in a circular layout while the edges (events) that represent method calls are bundled to avoid clutter (cf. Figure 6). They provide an additional linked view where different methods are placed on top of the view, calls between them are shown with horizontal lines, and time is shown in the y axis. In somewhat similar fashion, De Pauw and Heisig [DPH10] use the vertical axis for encoding time, while the horizontal view encodes different processes. Within each process column, events are represented as blocks color-coded according to different software components. The horizontal position within a process column denotes the load module where the calls were generated.

The trace visualizations thus far are usually viewed in fractions of the total duration. Wu et al. [WYH10] creates dot plots of the entire trace versus itself, marking where the events are the same. Additional information is encoded as “bar codes” along the axes. The dot plot shows global patterns along the full timespan of the trace. The same method can be used to compare two different traces. Sambasivan et al. [SSMG13] focus specifically on comparing two request-flow traces with a side-by-side view, difference view, and animation between them.

In order to facilitate the size of data or to gain insight into the possible branching of events, traces are aggregated into call graphs as described in [JSB97].
6.2. Call Graph Visualization

Together with serial trace views, call graph views appear to be one of the most common visualizations in the software context for performance data analysis. In most cases, call graphs are tree structures, such as context call trees that are usually produced by profilers to help understand caller-callee relationships. Here, one should keep in mind not to confuse the debugging goal with the performance optimization goal. Call graph visualization for performance analysis purposes usually encodes additional performance data in itself or is shown together with other contexts such as tasks or hardware.

The most common representation of call graphs uses the node-link metaphor, where the node is usually a function (method) and the link represents a function call. In this regard, there are several tools that use an indented tree layout for visualizing the call trees [ABF+10, MW03, WG11]. Some of these tools integrate performance data directly into the nodes by color-coding them [MW03]. Others use the horizontal space provided by the indented layout to add tabular data or even small barcharts or histograms. They may also employ computational methods to find hot-paths, at which point the corresponding branch would expand and direct users’ attention to the relevant portion of the tree [ABF+10]. However, due to the size and complexity of the call graphs, performance and statistical data is usually visualized using multiple coordinated views.

Other node-link layouts mainly use the conventional tree drawing algorithms [SM06, DPH10, Rei90, DPH10, LTOB10, WKT04, AdSL’09]. Usually some data is visualized using the color, shape or size of the nodes. For instance, DeRose et al. [DHJ07] managed to integrate load balancing data inside the nodes of the call tree by using the width and the height of the nodes as well as by integrating small barcharts inside the nodes.

Space-filling approaches such as treemaps [WKT04] and sunbursts [AH10] have also been used to represent call trees. Additionally, Adamoli et al. [AH10] present a view where a dissimilarity matrix is used to compare several calling contexts.

6.3. Code and Code Structure Visualization

Sometimes it is important to invoke specific lines of code where a potential performance problem is detected. Many tools that employ call graph visualization show the code as well, so that when users click on the specific node in the graph, the corresponding line of code is shown or highlighted in the code view [ABF+10, SG93].

However, there are also approaches that visualize the code and the performance data together. One of the first such examples is the Seesoft tool [ES92]. Here, each line of code is represented by a line of pixels and color-coded according to the number of executions, providing the user with an easy way to notice “hot spots.” A similar idea is presented by Liao et al. [LDB+99] where each code character has been encoded into a pixel and the color denotes various cycles in the code. The aim of this visualization is improving parallelization.

There are approaches to show specific parts of code in other contexts as well. TraceVis offers functionality where the user can specify regions of static code [RZ05]. It will then color-code the background regions of the trace view, showing which static code elements map to the dynamic trace data. It is also possible to select a specific region in the dynamic trace view and automatically color-code all static instructions.

In some cases certain structural features of the code, such as class hierarchy, should be analyzed in context of the performance data in order to understand if a potential problem is originating from the application code or an external library. One approach is to visualize software modules or class hierarchy. Icicle plots could be used in this case as well [CHZ’07]. Figure 6 shows the use of icicle plots in a radial layout. SynchroVis shows program traces in the static structure of the program, visualized using a three-dimensional city metaphor [WWF+13]. Here different features of the city are mapped to code structures. For example, districts represent packages while buildings represent classes. This work is conceptually similar to the previous two-dimensional representation approach [JSB97]. One of the most straightforward methods to map different software components or modules in other contexts is color. For instance, different parts of a call tree can be color-coded according to the component they belong to [AH10, LTOB10].

7. Tasks Visualization

The fundamental context required by tasks visualizations are the attribution of the performance data to the abstract actors that generated it. These actors include processes, threads, and jobs. Further context in this area includes the hierarchical structure of the actors (e.g. what threads belong to what process). Some tasks visualizations are able to take advantage of other contexts, such as the specific nodes or sites where the process is being run.

Execution traces and system logs are often recorded with tasks context. These documents capture timestamped events such as function entry, message receives, and job initiation. Traces and logs offer analysts a full record of what occurred, but this increases the difficulty of making sense of them. Ordering of events can unveil bottlenecks, delays, and anomalies. Patterns in utilization and communication can be found over the duration of the data collection. The time component of this data is essential in this analysis, so there are many visualizations that attempt to display the time streams per task. We discuss these in Section 7.1.

However, time is not a necessary component in tasks vi-
7.1. Time in Tasks Visualizations

The majority of time-based tasks visualizations assign time to the $x$ or $y$ axis and then constrain the events of each task to bars in a row or column respectively, similar to a Gantt chart. These visualizations generally omit the call stack information found in single task trace visualizations (Section 6) as available space for the parallel tasks is already a challenge. A typical visualization of this type would be Vampir [NAW*96] as shown in Figure 7. We refer to these as timelines and classify them and closely related ones in Section 7.1.1. Other techniques such as animation and sonification are discussed in Section 7.1.2.

7.1.1. Task Timelines

We classify task timeline visualizations by their representation of both time and the relationships between individual timelines. Most visualizations use physical time (e.g. wall-clock time, system time, and cycle counts), which is generally what is recorded in traces and logs. However, some visualizations support logical time, a partial ordering based on dependency information, often Lamport clocks [Lam78]. Cuny et al. [CHK92] claim that logical time is needed for debugging the correctness of parallel programs, while physical time is more important for performance where the ultimate goal is decreasing the total time required by the program. PARADE [KS98] supports phase time which is a partial ordering of phases of an execution rather than individual events. However, computing phases from trace data without extra information is difficult.

Some visualizations show no relationships between timelines [DPH10, SG93, LSV*89, TBD10, Sha90, Rei90]. Zinsight [DPH10] recognizes a hierarchy of tasks and allows users to select which granularity to plot events. While Vampir’s default view shows messages between timelines, it also provides cluster timelines which show aggregated events over the cluster with no messages [VMa13].

SIEVE [SG93] draws contour lines across the tasks where the events are equivalent as shown in Figure 8. Muelder et al. [MGM09] show log-scale duration versus time, rather than placing tasks or groups of tasks on the $y$-axis. Instead, events from all tasks were drawn over the same area and overplotting and blending techniques were used to show consensus (or lack their of) among tasks.

In some cases, intertimeline relationship data may not exist. HPCToolkit [ABF*10] visualizes sampled data rather than full traces. It shows all tasks in an information mural style display, sampling each pixel for its task and sample contributors. Individual tasks can be selected for a detailed single timeline display.

De Pauw et al. [DPWB13], SeeLog [EL96], and lviz [WYH10] show separate program instances, which unlike processes do not interact directly. The De Pauw et al. visualization (Figure 9) displays job lifetimes on a shared system in an online stacked graph-like visualization that groups jobs by user. Rather than assigning rows to jobs permanently, De Pauw et al. changes the $y$ value over time so the clusters remain contiguous and separate from each other. SeeLog shows classes of applications per row with glyphs indicating how many are active rather than bars. lviz visualizes job logs on Microsoft Windows in a dot plot, revealing repeated patterns of jobs over time. The dot plot can be used to compare two separate logs by assigning one log to rows and another to columns.
Timelines may affect each other through dependency constraints such as message sends and receives or access to shared objects. Message dependencies are often shown as a line or arrow from the send on one timeline to the receive on another [YSM95,SKV03,ZLGS99,LMCF90,FB89, dKSB00,KS98,HE91,PLCG95,KZLK06,TSS98,KTD13,SHN10,SRWS99, KTM97,KG96]. Including these types of dependencies makes it possible to highlight critical paths. Virtue [SRWS99] draws timelines in 3-dimensional space, using a ring layout for tasks rather than an axis. The visualization is also compatible with a CAVE environment. VisualLinda [KTM97] and Triva [SHN10] use 3D in order to cluster tasks by their location on physical processors in two of the dimensions.

SyncTrace [KTD13], shown in Figure 10, draws a serial timeline overview for a selected thread and a focus view which draws multiple threads as sectors of a circle. The call stack is maintained for these threads, resulting in a sunburst-like design. Relationships between the focus thread and other threads are drawn as aggregated edges, similar to a chord diagram.

While these are also dependencies, we separate them because they involve the addition or removal of tasks in the visualization. Many of the visualizations supporting lifetime relationships also support message or shared object dependencies.

DOTS [BKS05] uses a Sugiyama-style layout algorithm to assign threads to columns and route dependency lines. The threads are grouped by processor. ThreadScope [WT10] also uses a layered node-link diagram with line styles representing different relationships and node styles representing both threads and memory. The graph can be condensed through grouping by malloc calls or classes.

Several visualizations represent parent-child relationships among tasks but do not emphasize creation and destruction; instead the space is allocated to the task far past the extent of its lifetime. We do not consider these as showing lifetime relationships because it is not clear if the task is non-existent or just idle. Wang and Kunz [WK00] modify the usual timeline view to show the lifetimes of migratable objects as they move between individual timelines representing machines in distributed systems.

Table 1 organizes the task timeline visualizations by what relationships are present between the individual timelines and what type of time is displayed.

### 7.1.2. Other Time-based Tasks Visualizations

Several visualizations use animation to represent time, showing the state of the tasks at every instance. VISTOP [BB92] uses a mailbox metaphor to show messaging and semaphore activity and a directory to show thread spawning relationships. SynchroVis [WWF+13] uses a city metaphor to represent the static structure of the program with special buildings where added floors represent thread and shared object creation. Arrows connecting to the special buildings show the evolution of the system in time.

Belvedere [HC88] and its follow-up Ariadne [CFH+93] animate messages between processes in logical time for debugging. Streamsight [DPA09] creates a node-link diagram with processing elements as nodes and streams between them as links. Grouping by job or host makes aggregation and clutter reduction possible. This visualization allows for real-time monitoring but can also be recorded and replayed.

Sigovan et al. [SMM13b] animate events as rising bubbles per process which fade into the background at the end of their duration, creating a contextual history. Using overplotting and blending techniques, this animation is able to scale to 16K processes.

PARADE [KS98] and PVaniM [TSS98] place processes on a circle and animate messages moving between them. Growing Squares [ET03] similarly places processors when animating dependency relationships between processes in logical time. Process squares ‘grow’ outlines that incorporate the colors of other processes that have causally affected
<table>
<thead>
<tr>
<th>Relationships</th>
<th>Time</th>
<th>Visualizations</th>
</tr>
</thead>
<tbody>
<tr>
<td>None</td>
<td>Physical</td>
<td>ConcurrencyVisualizer [GN10], De Pauw et al. [DPWB13], Devise [KMLM97], Falcon [GEK∗95], HPCToolkit [ABF∗10], lviz [WYH10], Muelder et al. [MGM09], PIE [LSV∗89], Reilly [Rei90], SeeLog [EL96], Sharma [Sha90], SIEVE [SG93], Trümper et al. [TBD10], Vampir [NAW∗96], Zinsight [DPH10]</td>
</tr>
<tr>
<td>Dependency</td>
<td>Physical</td>
<td>AIMS [YSM95], Jumpshot [ZLGS99], Moviola [LMCF90, FB89], Pajé [dKSB00], PARADE [KS98], ParaGraph [HE91], PARA VER [PLCG95], Projections [KZLK06], PVaniM [TSS98], SyncTrace [KTD13], Triva [SHN10], Virtue [SRWS09], Visual.inda [KTM97], XPVM [KG96]</td>
</tr>
<tr>
<td>Lifetime</td>
<td>Logical</td>
<td>Concurrency Maps [Sto88], DeWiz [SKV03], Moviola [LMCF90, FB89]</td>
</tr>
</tbody>
</table>

Table 1: Classification of Tasks Timelines

Figure 11: Snapshot from the animated trace visualization by Sigovan et al. [SMM13b]. As events persist, they rise upward logarithmically. Image courtesy of C. Sigovan.

Sonification methods have also been used. Francioni et al. [FAJ91] and Pablo [RRA∗93] map tasks to separate instruments or tones, having them sound for the duration of particular events or other states of interest (e.g. idleness).

7.2. Visualization of Non-Time Tasks

Several visualizations show the process communication graph, a summary of all messages among processes in some time frame. Adjacency matrices are frequently used [HE91, RRA∗93, VMa13]. Bhathele et al. [BGI∗12] modify a node-link diagram to aggregate processes with similar delay behavior into arcs.

Kim et al.’s [KLJ07] method represents threads as points on a cone, with the distance from the apex indicating creation depth. Threads can be aggregated to reduce clutter.

ParaProf [SML∗12] colors tasks by user-chosen metrics and provides a scripting language so the user can decide the task layout.

8. Application Visualization

Application contexts are specific to the problems that their target programs are attempting to solve. This generally corresponds to the perceived space of the input and output of the program. For example, the application context of a matrix multiplication program is the space of the matrices involved in the operation.

ParaGraph [HE91] includes facilities to generate performance displays in the application context, noting that such displays could provide new detail and insight, but also mentioning that such displays are highly non-trivial and application specific. They show an example of data transaction counts of a matrix operation overlaid onto the input and output matrices.

A similar visualization involving parallel prefix sums was shown by Stasko and Kraemer [SK93]. They created an animation showing different processors operating on different parts of the input dataset. This visualization proved useful in debugging parallelism issues in the prefix code and provided a stronger understanding of the utilized parallelism.

Schulz et al. [SLB∗11] observed that application developers find the application context highly intuitive. They created visualizations which successfully uncover application-specific performance bottlenecks, by arranging processors and their generated counter data by the physical regions within the bounded fluid dynamics simulation they computed. The visualization showed that areas of high computational and bandwidth costs occurred in areas of high fluid turbulence, as seen in Figure 12. Schulz et al. also observed a need to complement the application context view with other contexts, such as hardware and communication, noting that problems become more obvious when projected into the context from which they originate.

Wylie and Geimer [WG11] likewise created a visualization with processor computation time attributed to physical
regions in a large scale reservoir simulation. They also observed application-specific performance bottlenecks where certain areas of the dataset incurred larger computational overhead.

The aforementioned applications generally have intuitive contexts in 2 or 3-dimensional space, but it is a challenge to depict the application context of programs with high-dimensional or abstract output. Furthermore, application visualizations have required significant implementation effort by the analyst, rather than allowing them to leverage existing visualization tools associated with simulation output rather than performance.

9. Challenges

In this section we discuss challenges in performance visualization. Many of the challenges highlight a need for close and continued collaborations with domain experts. The challenges of parallel scale, system complexity, and attribution require expert input to craft useful and informative visualizations. Experts in HPC are not strangers to issues of data scale, and it is highly beneficial to harness this experience in an effort to create scalable and useful visualizations. Finally, the integration of new visualizations with data collection or performance workflows necessitates sustained partnerships between communities.

9.1. Scale

As the scale of computing resources continues to grow exponentially, and along with it, the scale of the collected performance data, it is becoming increasingly critical to create highly scalable performance visualizations. There are two major scale challenges facing performance visualization: parallel scale and data scale. Parallel scale refers to the number of elements required by the context that the visualization is attempting to represent simultaneously. This includes nodes, cores, and memory addresses in the hardware context and tasks, processes, threads, and jobs in the tasks context. Data scale refers to the amount of data collected that need not necessarily be displayed all at once, but must be processed by the visualization. There are several ways to think about data scale – file size, execution time, and total number of samples or events. Figure 1 shows the reported parallel and data scales of the most recent performance visualizations.

As Figure 1 shows, few of the visualizations methods cited demonstrated an ability to handle tens of thousands of simultaneous tasks, and some that do only do so for statistical plots, not for more sophisticated views. Others simply average across pixels, which may hide the insights users seek. At the same time, requiring users to pan extensively within detailed visualizations is not reasonable. While some tools may scale, the utility of the visualization does not. Creating sweet spots between full aggregation and largely unprocessed detail so that the necessary contexts are still shown remains a challenge.

As the size of acquired performance data increases, it has become necessary to scale not only the visualization, but the underlying data. Though this problem has been often neglected by the visualization community, tools developed within the performance community have begun to address the issue. HPCToolkit [TMCF*11] maintains interactivity of its views by sampling the data rather than reprocessing all of it during panning and zooming. It reduces the size of data during collection through sampling as well. Vampir [ISC*12] can utilize the same systems it is meant to analyze, handling terabytes of data via the parallel filesystem and an allotment of processors. It can also employ data reduction techniques during collection. As the data size increases, it may not be feasible to save the entirety of the collected data, so integrating more approaches like the parallel system usage of Vampir or the sampling-based functionality of HPCToolkit is crucial. The greater use of sampling and the effects of overhead and clock synchronization necessitate more techniques for handling uncertainty.

9.2. System Complexity

Many of the continuing challenges in performance visualization are the product of the ever-evolving technology in high performance computing systems. Network topologies...
are constantly changing and increasing in dimensionality to improve parallelism and efficiency, and as a result existing techniques may quickly become obsolete. Previous networks had natural embeddings into 2- or 3-dimensional space but this is no longer the case for the largest systems. New layouts are needed that can leverage the inherent structure of these networks and improve developer’s understanding of them. Furthermore, as hardware developers expose new ways to capture different performance events, visualizations must adapt to fully utilize the new and changing performance data and contexts.

9.3. Ensemble Runs

Analysts often must make comparisons among different executions of the same application to determine the most likely causes of performance differences or to validate the performance benefits of changes in algorithms or parameters. Few visualizations we surveyed had support for handling ensemble visualizations we surveyed had support for handling ensemble datasets. Those that did were limited to or demonstrated only a few at a time [BW12, WYH10, TMCF∗11]. Instead, users generally compare two executions by examining visualizations of each one individually. This is an area where visualization can help reduce the cognitive load on the analyst. Showing differences can be tricky – even in the two-run case there are issues of normalizing metrics and resolving multiple corresponding entities.

9.4. Coordination

The current state of performance visualization software is scattered amongst a variety of tools and techniques serving different purposes. As different techniques are able to accomplish different subsets of the tasks delineated in section 3 and no individual technique accomplishes all of them, this chaotic state is largely unavoidable. While we presented the relevant research in four context categories, we have observed that the distinction between contexts is not always well-defined and as such visualizations need not be constrained to any single one. Many of the studies we have observed have highly validated the usefulness of combining multiple techniques, whether closely tied together in the form of linked views [VMa13], or simply applying different techniques to the same target program [KZLK06]. One of the main challenges in performance visualization therefore is the development of improved integrations of multiple views and performance data in intuitive ways.

9.5. Attribution

While complex visualizations can elucidate novel or interesting performance data, it is important to keep in mind that the visualization has to aid the developer in accomplishing some set of performance goals. Many of the examples in Figure 1 reflect case studies that rely heavily on expertise or insight from the user. Most solutions that handle attribution directly do so at the function call or line-of-code level. The area of performance goals targeted least by the surveyed papers was attributing performance problems to semantically high-level reasons and determining possible avenues of improving the code.

9.6. Evaluation

When dealing with more complex systems and programs, especially in the high-performance computing field, the number of domain experts capable of participating in user studies is small. Therefore, full-fledged usability evaluations or controlled experiments with large numbers of participants are rare (for example [SSMG13]). However, variations of expert evaluations can be performed [TM05] as was done in De Pauw and Heisig [DPH10]. These require a small number of domain and visualization experts, making them more feasible to conduct. The surveyed papers have in general not studied the usability of performance visualization methods and interfaces. Expert evaluations with the inclusion of usability would further help to fill this gap in knowledge.

10. Conclusions

Performance visualization is a growing field which continues to adapt to the growing ecosystem of high performance computing. As supercomputers become more powerful, increasing effort is required to understand how different software is run on such machines and optimize their performance. Rising complexity of systems and performance data collected on them invites the utilization of visualization and analysis tools. Largely driven by necessity, performance visualization presents new and challenging research questions, many of which remain to be answered.

We have presented a survey of existing approaches in performance visualization. The current work has been organized based on the primary contexts in which the data has been visualized. Moreover, we have presented and categorized the goals that domain experts seek to address through visualization. Finally, we have discussed the existing challenges in this domain. This survey should act as an introduction to the state of the art for information visualization experts seeking to apply their knowledge to new domains. It also may aid HPC professionals in exploring new tools and methods to analyze their data.

11. Acknowledgements

We thank the participants of the Dagstuhl Perspectives Workshop 14022 “Connecting Performance Analysis and Visualization to Advance Extreme Scale Computing” for their constructive discussions that inspired parts of this paper.

This work supported in part by the University of California Laboratory Fees Research Grant Program and the
Department of Energy Office of Science Graduate Fellowship Program (DOE SCGF) administered by ORISE-ORAU under contract no. DE-AC05-06OR23100. This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-CONF-652873.

References


[BD01] BOSCH R., DEPT S. U. C. S.: Using visualization to understand the behavior of computer systems. Stanford University, 2001. 6, 7, 15


[dkSB00] DE KERGOMMEUX J. C., STEIN B., BERNARD P.: Paqj, an interactive visualization tool for tuning multi-threaded parallel applications. Parallel Computing 26, 10 (2000), 1253 – 1274. doi:10.1016/S0167-8191(00)00100-7. 6, 11, 12


Katherine E. Isaacs, Alfredo Giménez, I. Jusufi, Todd Gamblin, Abhinav Bhatele, Martin Schulz, B. Hamann, & P.-T. Bremer / State of the Art of Performance Visualization

Biography

Katherine E. Isaacs is a third year computer science Ph.D. student at the University of California, Davis researching information visualization techniques for performance analysis. In 2012 she was awarded a Department of Energy Office of Science Graduate Fellowship (DOE SCGF). She completed a B.S. in computer science and a B.A. in mathematics at San José State University and a B.S. in physics at the California Institute of Technology.

Alfredo Giménez is a third year PhD student at the University of California at Davis. His research focuses on information visualization for performance analysis on HPC systems. He received a B.S. in Computer Science at the University of California at Davis in 2010 and worked for 2 years developing performance optimization tools for graphics hardware at Intel Corporation.

Ilir Jusufi is a Postdoctoral Scholar at the University of California, Davis. His research focuses on visualization of performance analysis data for HPC. He received a B.S. in Computer Science at the South East European University in Macedonia and a M.S. in Computer Science at the Växjö university in Sweden. He earned his Ph.D. degree at the Linnaeus University in Sweden focusing on the visualization and interaction techniques of multivariate networks.

Todd Gamblin is a computer scientist in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. His research focuses mainly on scalable algorithms for measuring, analyzing, and visualizing performance data from massively parallel applications. He is also interested in fault tolerance, resilience, MPI, and parallel programming models. Todd has been at LLNL since 2008.

Todd works closely with researchers in CASC and with staff in the Development Environment Group in Livermore Computing. He is the team leader for the Performance Analysis and Visualization at Exascale (PAVE) project, and he also works on the Exascale Computing Technologies LDRD project, the SciDAC Sustained Performance, Energy, and Resilience (SUPER) project, and many other ASC projects at LLNL.

Todd received the Ph.D. and M.S. degrees in Computer Science from the University of North Carolina at Chapel Hill in 2009 and 2005. He received his B.A. in Computer Science and Japanese from Williams College in 2002. He has also worked as a software developer in Tokyo and held graduate research internships at the University of Tokyo and IBM Research.

Abhinav Bhatele is a computer scientist in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. His interests lie in performance optimizations through analysis, visualization and tuning and developing algorithms for high-end parallel systems. His thesis was on topology aware task mapping and distributed load balancing for parallel applications.

Abhinav received a B. Tech. degree in Computer Science and Engineering from I.I.T. Kanpur, India in May 2005 and M.S. and Ph.D. degrees in Computer Science from the University of Illinois at Urbana-Champaign in 2007 and 2010 respectively. Abhinav was an ACM/IEEE-CS George Michael Memorial HPC Fellow in 2009. He has received several awards for his dissertation work including the David J. Kuck Outstanding MS Thesis Award in 2009, a Distinguished Paper Award at Euro-Par 2009 and the David J. Kuck Outstanding PhD Thesis Award in 2011. Recently, a paper that he co-authored with LLNL and external collaborators was selected for a best paper award at IPDPS in 2013.

Martin Schulz is a Computer Scientist at the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory (LLNL). He earned his Doctorate in Computer Science in 2001 from the Technische Universität München (Germany) and also holds a Master of Science in Computer Science from the University of Illinois at Urbana-Champaign. He has published over 150 peer-reviewed papers. He is the PI for the Office of Science X-Stack project "Performance Insights for Programmers and Exascale Runtimes" (PIPER) and for the ASC/CCCE project on OpenSpeedShop. Further, he is the chair of the MPI forum, the standardization body for the Message Passing Interface, and is involved in the DOE/Office of Science Exascale Projects CESAR ExMatEx, and ARG0. Martin’s research interests include parallel and distributed architectures and applications; performance monitoring, modeling and analysis; memory system optimization; parallel programming paradigms; tool support for parallel programming; power efficiency for parallel systems; optimizing parallel and distributed I/O; and fault tolerance at the application and system level. In his position at LLNL he especially focuses on the issue of scalability for parallel applications, code correctness tools, and parallel performance analyzers as well as scalable tool infrastructures to support these efforts.
Bernd Hamann is a professor of computer science at the University of California, Davis. He studied mathematics and computer science at the Technical University of Braunschweig, Germany, and received a Ph.D. in computer science from Arizona State University in 1991. His main teaching and research interests are data visualization, data analysis and geometric modeling.

Peer-Timo Bremer is a member of technical staff and project leader at the Center for Applied Scientific Computing (CASC) at the Lawrence Livermore National Laboratory (LLNL) and Associated Director for Research at the Center for Extreme Data Management, Analysis, and Visualization at the University of Utah. His research interests include large scale data analysis, performance analysis and visualization and he recently co-organized a Dagstuhl Perspectives workshop on integrating performance analysis and visualization. Prior to his tenure at CASC, he was a postdoctoral research associate at the University of Illinois, Urbana-Champaign. Peer-Timo earned a Ph.D. in Computer science at the University of California, Davis in 2004 and a Diploma in Mathematics and Computer Science from the Leipniz University in Hannover, Germany in 2000. He is a member of the IEEE Computer Society and ACM.