I Introduction
In meshbased numerical simulations, sweep is an important computation pattern widely used in solving deterministic Boltzmann Transport Equation (BTE) [4], convection dominated or NavierStokes equation [5][6] and so on. During sweep on a mesh, cells are computed from upwind to downwind in the sweeping direction. One cell can compute only when all of its upwind neighboring cells are computed.
General sweep computation on largescale meshes is challenging. For rectangular structured meshes where the data dependencies can regular, the wellknown KochBakerAlcouffe(KBA) [8][9] algorithm which uses a pipelining wavefront way to parallelize multiple sweeps has been very successful. However, for the more general deforming structured meshes and unstructured meshes in which data dependencies among cells are irregular, the KBA method doesn’t fit and is typically impossible. Instead, a datadriven approach [15][16][20] is often considered. This approach models the cells’ data dependencies as a directed acyclic graph (DAG), regardless of the mesh types, then sweep on the mesh is equivalent to a topological traversal on the DAG. Unfortunately, although KBAbased sweep on regular structured meshes has scaled to CPU cores and billions of cells in 2015 [22], sweep on ordinary unstructured meshes still doesn’t efficiently scale to cores.
Meanwhile, meshbased application programming frameworks[1][2][28][29] have been increasingly critical to today’s complex simulations that require to couple multiple multiphysics procedures. On one hand, multiple simulation procedures developed on a unified framework share the same specification of mesh and data layout, and thus are more consistent to work together. Given the fact that coupling different simulation procedures is difficult for both software development and numerical validation, this really makes sense. On the other hand, by providing users a programming abstraction and ensuring reasonable performance, the framework isolates applications from the evolution of underlying HPC systems, and thus achieves good portability. Recently, in areas related to particle transport in which sweep on mesh is the most timeconsuming portion, frameworkbased coupling of multiphysics simulations have led to several successful cases, including the full core reactor simulation based on MOOSE[35] and the ICF (Inertia Confinement Fusion) program LAREDI[34] based on JASMIN[1]. Nevertheless, so far these cases are still on structured meshes only.
In this paper, we focus on the patchbased mesh application framework, specifically JAxMIN [1][2] (detailed in section IIB), where the mesh is divided into patches. Patch is conceptually a subdomain of the mesh. The patchbased approach has advantages on adaptive mesh refinement, mesh management and parallel computation scheduling. Over 50 realworld applications have been implemented on JAxMIN. However, JAxMIN, like most counterparts, adopts BSP (Bulk Synchronous Parallel) [13] style of parallel computing, in which all subdomains (patches) first compute using previous data of themselves and other subdomains, and then communicate to update their remote copies. Although being efficient and scalable enough for most numerical solvers, BSP is seriously inefficient for datadriven sweep computation where the parallelism is finegrained. Furthermore, the fact that a patch often can’t finish computation at one time and thus has to compute many times, as well as complex factors impacting performance, makes it hard to realize in JAxMIN’s BSPbased abstraction.
We propose JSweep, a patchcentric datadriven approach for parallel sweep computation on both structured and unstructured meshes, embedded in the JAxMIN infrastructure. Fig.1 illustrates the JSweep modules in the abstracted layers of JAxMIN. Specifically, our key contributions are as follows:

The patchcentric datadriven abstraction, a unified model for general datadriven procedures on both structured and unstructured meshes. The core idea is extending the concept of patch to a logical processing element that supports reentrant computation and communicate with other patches (Sec. III).

The patchcentric datadriven runtime module for contemporary multicore cluster systems, featuring hybrid parallelism (MPI+threads) and dynamic data delivery (Sec. IV).

A sweep component based on the above patchcentric approach, enhanced by vertex clustering, multilevel priority strategy, patchangle parallelism and coarsened graph. (Sec. V).

Experimental evaluation with real applications of particle transport on both structured and unstructured meshes, demonstrating JSweep’s reasonable performance and scalability on up to 76,800 processor cores (Sec. VI).
Ii Background and Motivation
Iia Preliminaries
Throughout this paper, we use a small set of terminologies as illustrated in Fig. 2. Note that we describe them in an abstract view and don’t explicitly differentiate the structured and unstructured meshes unless otherwise stated.
mesh/grid: the generic way of describing the discretized domain.
cell: the smallest unit of a discretized domain.
patch: a collection of contiguous cells.
local cells: cells owned by a patch. They are updated by an operator applied to the patch.
ghost cells: halo cells surrounding local cells. They are needed for computation but not updated by the local operator. They belong to other patches.
IiB JAxMIN: Patchbased Mesh Application Infrastructure
JASMIN [1] and JAUMIN [2] (denoted as JAxMIN^{1}^{1}1Abbreviation of J Adaptive Structured/Unstructured Mesh INfrastructure. for simplicity), are software infrastructures for programming applications of largescale numerical simulations on structured and unstructured meshes. They share the same design principles, i.e., patchbased mesh management and componentbased application programming interface. Although they are different in the way of describing mesh geometry and mesh elements, we omit these details by discussing in an abstract level in this paper.
JAxMIN adopts a patchbased strategy to manage the mesh and data. The computational domain, discretized as mesh, is decomposed into patches. Patch is a welldefined subdomain, that (1) each patch has complete information about its own cells as well as other mesh entities, (2) with ghost cells, each patch can explicitly get all adjacency information about its neighboring patches, and (3) it is abstract enough to hide differences of structured and unstructured meshes.
JAxMIN provides users a rich set of components as the programming interface. Component here is a generic implementation of any computational pattern. To implement a parallel program, users only need to instantiate a component by defining the applicationspecific computation kernel. So far, general patterns such as initialization, numerical computation, reduction, load rebalance, particleincell communication, are provided. Besides, JAxMIN implements amounts of physicsspecific integration components.
Traditionally, JAxMIN adopts the BSP model to organize computations in a component. The computations consist of a sequence of iterations, called supersteps. During a superstep, each patch executes logically in parallel, as follows: (1) does computation independently without data exchange with others, and then (2) does halo exchange communication with neighbors using newly computed data. Since many numerical algorithms fit well in BSP, the patchbased framework has been successful in many areas.
JAxMIN implements a high performance runtime system supporting hybrid MPI+threads parallelism and accelerators, with underlying optimization on memory management, data layout and buffering communication.
IiC Datadriven Parallel Sweeps
Without loss of generality, we consider the sweep computation in discrete ordinates () transport solvers. Sweep is the most computationally intensive portion of source iterative methods solving form of Boltzmann Transport Equation[4]. As the name implies, sweep in any ordinate direction requires a computational ordering of cells from upwind to downwind. One cell can begin computing only if all of its upwind neighboring cells are computed.
Parallelizing sweep computation is challenging since it can’t be efficiently implemented in a BSP manner. For regular structured meshes, the KBA approach, decomposing 3d meshes in a 2d columnar fashion and pipelining the computation for successive angles, is sufficient with BSP. However, for more general deforming structured meshes and unstructured meshes where data dependencies among cells are irregular and thus the pipeline can’t be easily determined, the KBA approach is almost impossible to implement.
Alternatively, we focus on the datadriven parallelization which is a general approach for sweeps on both structured and unstructured meshes[15][16]. In this approach, any complex and irregular data dependencies can be explicitly modeled by a directed acyclic graph. As an example, Fig. 3 illustrates a 2d unstructured mesh and the associative directed graph in a given sweeping direction. Then, the sweep on a mesh is equivalent to a topological traversal on the directed graph, generalized with the userdefined numerical computations on the vertex.
In reality, multiple sweeps in different ordinate directions (angles) that are carried out simultaneously. We can model their induced data dependencies in a single graph and implicitly leverage parallelism of sweeps from all angles.
IiD Motivation
Now we consider the datadriven parallel sweep procedure in the context of the patchbased framework, especially JAxMIN. Unlike other numerical algorithms, patchlevel datadriven computation can’t be naturally supported in BSP, due to difficulties described below. These difficulties motivate us to develop a new patchcentric datadriven abstraction in the next section.
IiD1 Partial computation
In datadriven scenarios, partial computation of the patch is essential. As illustrated in Fig. 4 where one mesh is partitioned into two patches, interleaved data dependencies between the patches means that the patch can’t be computed as a whole. In reality, the above zigzag data dependency can be normal in unstructured meshes. Thus, to be reentrant, partial computation of a patch is necessary.
IiD2 Simultaneous sweeps on a patch
In realworld applications, sweeps from multiple directions are often performed in parallel. For example, in the sweeps example illustrated in Fig. 5, one patch would be swept by multiple sweep procedures from 4 different directions. Generally, it is common that some sweeping directions are independent to each other. Thus, to enable such parallelism, simultaneous sweeps on a patch is necessary. In JAxMIN, however, patch is the basic unit of parallel computation, so we need to extend its abstraction.
IiD3 Priority strategies
Priority strategies are becoming more important and complex. Previous work [15][16] have proven that ordering of computing cells (or vertices) is often critical to both parallelism and performance. In their settings, since one process (in MPI) or thread handles only one mesh subdomain, it is sufficient to compute priorities of cells (or vertices in the associated graph). However, in the patchbased framework, one process or thread typically is assigned with arbitrary number of patches, which means patch scheduling is always prioritized than cells within a patch. Thus, we need at least a twolevel policy to prioritizing both patches and cells.
Iii Patchcentric Datadriven Abstraction
In this section, we introduce the patchcentric datadriven abstraction for meshbased parallel computations. Its foundation is the completeness and expressivity of the patch concept in JAxMIN described in section IIB. In our abstraction, the concept of patch is further extended as a logical processing element being able to compute on itself and communicate with any other patches. Users should follow a thinklikeapatch philosophy to program, and focus on only actions of a single patch, i.e., defining the local computation and interpatch communication. The abstraction doesn’t expose any details of underlying patch execution details. It should be suitable for all patchbased mesh application frameworks including SAMRAI[28], (partbased) PUMI[29] and especially JAxMIN.
Iiia Datadriven patchprograms
Datadriven logics on a patch is encoded as a patchprogram. The patchprogram is identified by a pair, indicating is executed on . Any data communication between two patches is abstracted as a stream. The stream contains the userdefined data and description of source and dest patch programs. Fig. 6 presents the interface of patchprogram and stream, in which the patchprogram is factored into five primitive functions.
We define patchprogram fully reentrant to support partial computation (detailed in the next subsection). At the beginning, each patchprograms is set active. And, in later execution, the state of a patchprogram transits according to the finite state machine given in Fig. 7. If its vote_to_halt function is evaluated true, the patchprogram becomes inactive. Once receiving a stream, the patchprogram becomes active. Conceptually, if there are no active patchprograms globally, the whole program terminates.
The patchprogram, identified by (), is scheduled to run the semantics in Alg. 1, as follows.
If runs at the first time, the init function is used to initialize a local context.
Receives all streams sent to () by others, which is processed by the userdefined input function.
Calls compute function with userdefined numerical kernels.
Sends all output streams to and activates targets.
Calls vote_to_halt to evaluate whether there remains ready work to do. If not, deactivates itself. Otherwise, keeps active for being scheduled again.
IiiA1 Partial computation of patchprogram
Partial computation is an essential property of the patchprogram. On one hand, generally a patchprogram couldn’t finish at one time and thus requires many times of scheduling. As illustrated by the S sweeps case (Sec IID), two patchprograms would depend on data of each other, leading to a dead lock if patchprograms are not reentrant. On the other hand, a patchprogram may contain multiple parts of computations that depend on data of different patchprograms, so allowing a patchprogram to execute multiple times can benefit from finer grained parallelism.
In our abstraction, partial computation of a patchprogram is achieved by the following approaches. First, we allow storing of local context so the state are memorized, as illustrated by the implementation of sweeps in section VA. Second, the logics of finite state machine in Fig. 7 maintains state transition of a patchprogram, ensuring the correctness of termination after arbitrary times of partial execution.
IiiA2 Simultaneous tasks on a patch
Our abstraction supports multiple tasks on the same patch. Since any patchprogram is identified by the pair (), multiple tasks on a patch naturally execute in parallel, even with possible intertask communications. Whether and how to decompose work on a patch into patchprograms is the programers’ decision. For the full S sweeps discussed in Sec.IID2, by defining sweep on any patch from the angle as a patchprogram (), sweeps from all directions execute simultaneously.
IiiB Scheduling patchprograms
The datadriven engine initializes and continues to schedule active patchprograms to run until program termination.
For general patchcentric datadriven computations, the necessary and sufficient condition of program termination is that globally all s become inactive. To detect the termination condition in distributed situations, general negotiating protocols[14] are needed. However, in numerical algorithms requiring the datadriven approach, the workload is known in advance. Thus, we can often detect the termination with little or even no distributed negotiation. For example, in sweeps, the program termination condition is all s are computed, which is known by every patch before computation, and termination detection only need local information. In JSweep’s real implementation, we actually allow the patchprogram to commit its remained workload (i.e., number of in sweeps) to a data structure shared by the master and worker threads of local runtime system (detailed in next section). The master thread, as representative of the process, participates distributed terminate negotiation only when there are no longer patchprograms with remained workload.
Priority policies are known critical for scheduling computations, yet is tightly coupled with the properties of the problem itself. In section VD, we shall discuss several strategies used in parallel sweeps.
Iv Patchcentric Datadriven Runtime System
In this section, we present the runtime that maps the patchcentric datadriven computation and communication onto underlying resources. Our target platform is the multicore cluster widely adopted in contemporary HPC systems. Fig. 8 shows an overview of the runtime system. In particular, our design emphasizes fast stream delivery, load balance, finegrained parallelism and low schedule overhead.
Iva Hybrid parallelism
The runtime inherits from JAxMIN a hybrid parallel approach of MPI + threads, in which the program is organized with distributedmemory MPI processes and each process consists of multiple threads. On top of this design, JAxMIN have been highly optimized on both domain decomposition and data management. Besides, to reduce NUMA effect in multisocket systems, by default our runtime launches one MPI process per processor and bind the process to the processor. Within each MPI process, the masterworkers multithreading mode is adopted. As shown in Fig. 8, the master thread is in charge of scheduling patchprograms, communicating streams and detecting global termination, while each worker thread executes patchprograms and communicates only with master thread.
IvB Dynamic stream communication
In the patchcentric abstraction (section IIIA), communication conceptually happens between a pair of s and is defined as a stream. For datadriven algorithms, the latency of stream transmission is critical for performance, since only received the dependent data can a patchprogram execute. By definition, the communication of stream is asynchronous and dynamic.
The runtime system adopts the routable stream concept and reserves a specific core for master thread to support such timely communication. As defined in section IIIA, the stream itself carries full information of source and target s. By identifying the target patchprogram and looking up the route table that maps to , the runtime system can deliver any stream to its target place, either locally or in remote process.
The master thread schedules patchprograms by routing streams. At the beginning, all patchprograms are active and assigned to workers evenly. Later as the execution progresses, some patchprograms become inactive. If the master thread receives a stream whose target patchprogram is inactive, it chooses and sets a lightest worker as the patchprogram’s owner, and then routes it the stream.
IvC Distributed progress tracking
The master thread calls the progress tracker to detect global program termination. Once temporarily there is no longer work within the process, the progress tracker is activated. A consensus algorithms is implemented to detect distributed termination[14]. Besides the general negotiating protocol, as discussed in section IIIB, for known datadriven algorithms special condition detection methods would be preferred for efficiency in practice. Currently, we support both.
V New Parallel Algorithm of Sn Sweeps
Now we describe a new parallel sweeps algorithm based on JSweep, the above patchcentric datadriven approach. Further, we explore four optimizations, including scheduling by , vertex clustering, multilevel priority strategy, and coarsened graph, which are natural and efficient to implement thanks to the expressibility of the patchcentric abstraction.
Va PatchProgram Implementation
We assume that the mesh has been decomposed into patches with general spacial domain decomposition methods (for example, the METIS [18] and Chaco [19] for unstructured meshes, Morton and Hilbert space filling curves for structured meshes). Each process is assigned with an arbitrary number of patches, shared by all its threads.
Formally, we define the directed graph induced by sweeping meshes as , where each vertex is a pair, and each edge is directed data dependency between two vertices. An edge means vertex depends on vertex ’s data. For any patch and a sweeping direction , we denote the induced subgraph as , where is the set of vertices (i.e., { }) and is the set of edges.
Listing 1 presents the patchcentric implementation of parallel sweeps. As presented, the patchprogram consists of two parts, i.e., local context and interface implementation.
The local context contains all necessary states required by a reentrant sweep on the patch, including: (line 6) an array of counters that count the number of unfinished neighbors for each local vertex, (line 7) a priority queue storing ready vertices, and (line 8) streams later sent to other patchprograms.
The interface functions implement DAGbased datadriven sweeps on the patch in the direction . The init function initializes each vertex’s count variable to the number of its upwind neighbors, and collect source vertices into the ready queue . The input function receives data of vertices from remote patches, updates counts of related local vertices; once a local vertex’s count decreases to zero, put it to the ready queue. The compute function collect a sequence of ready vertices and computes on them with userdefined numerical computation, updates their downwind neighboring vertices. The output function generates streams sent to remote patchprograms. The vote_to_halt function evaluates whether the patchprogram should deactivate.
VB Optimization: PatchAngle Parallelism
JSweep naturally supports simultaneous sweeps on a patch, from different angles (i.e., sweeping directions). As shown in Listing 1, we achieve this by setting task tag of the patchprogram to the id of the sweeping direction. Consider the example in Fig. 5 again, in which full transport sweeps are carried on a 2d unstructured mesh of 2 patches. In this example, sweeps from different angular directions are independent, and thus patchangle parallelism can be fully enabled. This is especially useful for small meshes with large number of angles.
VC Optimization: Vertex Clustering
We adopt vertex clustering in the patchprogram. As shown in Listing 1, the compute function collects and computes on multiple ready vertices, rather than a single vertex. For example, in Fig. 4 vertices in the same dashed rectangular are clustered together.
Benefits of this optimization is twofold. On one hand, vertex clustering can dramatically reduce the scheduling overhead by reducing execution times of a patchprogram. For example, in Fig. 4 the patch 1 and patch 2 need only two and three executions respectively, compared to eight times of no clustering. On the other hand, vertex clustering aggregates multiple streams to the same target into a single stream and thus reduces communication overhead. For example, in Fig. 4, the intervertex communications of and can be combined into one message.
Nevertheless, we need to choose the clustering grain carefully. While reducing overhead of schedule and communication, vertex clustering also has potentially negative effect on parallelism since it may defer the communication thus delay scheduling of other patchprograms. Excessive clustering can lead to long communication delay and thus longer execution time. To illustrate this, consider SnSweepS, an example in JAxMIN package, which implements a solver for neutron transport equations on 3d structured meshes. The experimental results (mesh cells: , patch size: , S ordinates, 8*12 CPU cores) are shown in Fig. (a)a.
VD Optimization: Priority Strategy
We adopt a twolevel hierarchical priority strategy, i.e., priority and vertex priority.
The priority is used for JSweep runtime to schedule patchprograms. For patch and angle , its priority is calculated by the following formula:
, where is a constant factor. In sweep, to avoid waiting of downwind patches, we want patchprograms with the same angle are continuously scheduled to execute such that the data streams are delivered to the nearby patches as quickly as possible. Thus we set the importance of always higher than in the formula, by multiplying a large factor over . Meanwhile, with respect to , however we can’t reduce a single objective. On one hand, we hope the upwind patches are computed as earlier as possible such that more parallelisms are available. On the other hand, we also want the patches neignbouring other unfinished patches are computed earlier, but these preferred patches hare typically on the downwind of a sweeping direction. Based on the first objective, we develop two priority strategies: LDCP (Longest Distance on Critical Path) for structured meshes and BFS (Breadth First Search) for unstructured meshes. Based on the second objective, we develop the priority strategy SLBD (Shortest Local Boundary Distance, a DFS variant that prefers vertices most close to patch boundary) for both structured and unstructured meshes.
The vertex priority is used within a patchprogram to order local ready vertices in PriorityQueue in Listing.1. As , vertex priority also has to trade off more parallelism and earlier communication. The strategies proposed for , i.e., BFS, LDCP (for structured meshes only) and SLBD, are also suitable for vertex priority. In practice, however we observed that SLBD performs constantly best for unstructured and especially unstructured meshes, as illustrated by SnSweep experiments in Fig.(b)b.
VE Extra Optimization: Coarsened Graph
Coarsened graph, not presented in Listing.1 for length limit, can be treated as an extension to vertex clustering. In reality, the mesh structure and its data dependencies are always constant in most or even all sweeping iterations. Thus we can cache the vertex clustering results to build a reusable coarsened graph. For example, in Fig. 10, the directed graph (left) is transformed into a much smaller coarsened graph (right) according to the previous clustering results in Fig. 4.
Formally, we define a coarsened graph as the property graph: where is set of coarsened vertices derived and the set of coarsened edges. is the topology of vertex clusters and the directed communication relationships of vertex clusters. Property of a coarsened vertex , , is the series of corresponding clustered DAG vertices, while property of a coarsened vertex , , is the combined edges of source vertices and target vertices in DAG. For example, in Fig. 10, . Since is the task graph generated in the scheduling process, we have the computability theorem:
Theorem 1
If a directed graph is acyclic, its derived coarsened graph is also acyclic.
Besides, is distributed at the beginning of its construction. We implement it in same technologies presented in agentgraph [40]. With coarsened graph, sweep is carried on DAG in the first iteration and on in all subsequent iterations until the mesh changes. In our practice with JSNTS[25], the cost of building is less the one DAGbased sweep iteration itself while the speedup of sweeps on over DAG can be folds.
Vi Evaluation
Platform All experiments were carried on TianheII, the world’s fastest supercomputer in 2015[3]. We use at most 3200 nodes. Each node has two Intel Xeon E52692v2 12core processors, equipped with 64GB memory and TianheExpressII network of 40GB/s bandwidth. The operating system is Kylin Linux. All applications are compiled with Intel C Compilers (icc13) and customized MPICH2.
Applications We use two real JAxMINbased Sn applications, JSNTS [25] and JSNTU [26], to investigate the efficiency of JSweep on structured and unstructured meshes respectively. The used meshes are visualized in Fig. 11.
Via Evaluation on structured meshes
JSNTS [25] is a JASMINbased S package for structured meshes, which implements most functionalities of TORT [24]. We use the wellknown Kobayashi benchmark to evaluate JSweep. In particular, we focus on the strong scalability. In all the following experiments, JSweep is configured as follows: patch size = , vertex clustering grain = 1000, and the priority strategy is SLBD+SLBD.
We first evaluate JSweep with the original Kobayashi benchmark (Kobayashi400). It solves the single energy group S transport equations with scattering, on a cubic mesh (Fig.(a)a) of cells with 320 angular directions. As presented in Fig.(a)a, with increasing number of cores, JSweep shows reasonable scalability constantly, with a speedup 14.3 (or parallel efficiency 44.7%) on 24,576 cores compared to 768 cores.
We then evaluate JSweep on more CPU cores with a larger problem by modifying the mesh of Kobayashi input to proportionally, namely Kobayashi800. As shown in Fig. (b)b, JSweep scales to 76,800 cores with a reasonable speedup 7.4 (or parallel efficiency 46.3%), normalized to performance on 4,800 cores.
ViB Evaluation on unstructured meshes
JSNTU[26] is a JAUMINbased S package for unstructured meshes, primarily used for numerical simulations in high energy physics. We evaluate JSweep on two shapes of unstructured meshes, reactor core (Fig.(b)b) and ball (Fig.(c)c). Unless otherwise stated, default configurations of experiments are as follows: priority strategy SLBD+SLBD, patch size = 500 cells, vertex clustering grain = 64, #angles = 24 () and #energy groups = 4.
ViB1 Hyperparameters’ effect to performance
We change and investigate three hyper parameters respectively in order, i.e., patch size, vertex clustering grain and priority strategy, while keeping others default. As shown in Fig.(a)a (left), with increasing patch sizes (i.e., #cells of a patch), the runtime first decreases quickly since the larger patch size reduces total communication between patches, and then slightly increases since the larger patch size also leads to longer waiting time of downwind patches whose execution is driven by data from this patch. Fig.(a)a (right) shows the effect of maximum vertex clustering grains. With increasing vertex clustering grain, the runtime decreases quickly and then keeps steady. Unlike on structured meshes (Fig.(a)a), however, the runtime no longer increases with a large clustering grain. By profiling, we found that the actual number of available vertices is between 16 and 64 at most time, which means the real clustering grain is limited by parallelism. With respect to priority strategies, as shown in Fig.(b)b, their effect to performance is not so significant as that on structured meshes (Fig. (b)b).
ViB2 Strong Scalability
As shown in Fig. 14, JSweep performs good strong scalability on both small and large meshes. For the small scale problem (ball of 482,248 cells), JSweep shows a speedup of 11.5 (parallel efficiency 72%) at 384 cores and goes to a speedup of 75.8 (parallel efficiency 30%) at 6,144 cores, normalized to the 24core base. For the large scale problem (ball of 173,197,768 cells), JSweep shows a speedup of 9.9 (parallel efficiency of 62%) at 49,152 cores, normalized to the 3,072core base. Given that our tested mesh is a ball constructed with tetrahedrons, the above scalability should be reasonably good.
ViB3 Weak Scalability
Fig. 15 presents the results of weak scalability evaluated on the ball (originally 482,248 cells) and reactor core (originally 64,479 cells) meshes. In particular, mesh size is increased in a normal approximate refinement method. As shown, the weak scalability of JSweep is not good enough, although reasonable. For reactor, the parallel efficiency at 12,288 cores is about 40%, while for ball it is lower than 20%. One possible reason is that in JAxMIN, the original small mesh is first partitioned and distributed to processes, and then each process refines the assigned subdomain, leading to thick subdomains that dramatically increase length of critical path in the sweeping direction.
ViC Runtime Overhead Analysis
JSweep employs a runtimebased approach, thus the overhead is essential for performance. To investigate the overhead, we carry out a detailed profiling of JSNTS on small scale Kobayashi benchmark. In particular, the problem has a mesh. All optimizations are enabled and all hyper parameters are the same with that in Sec.VIA. We present one sweep iteration using coarsened graph.
Fig.16 shows the time breakup in a strongscaling fashion. The overhead introduced by JSweep (i.e., the graphop and pack/unpack) is moderately low (approx. 23%), and the major performance loss comes from idling of CPU cores (22%46%). Communication takes 13%19% the total time. With more deep optimization and advanced priority strategy, we expect to lower both the overhead and the idle time.
ViD Performance Comparison with other systems
We first compare JSweep’s performance with previous JAxMIN, which already implement efficient algorithmic optimizations [32] and achieve good performance. Nevertheless, we show that with innovations on datadriven abstraction and runtime system design, JSweep outperforms them on both structured and unstructured meshes.
Fig.(a)a presents results of JSweep and JASMINbased SnSweep program (a datadriven implementation of Sweep3D[12]). We choose SnSweep because it has been optimized manually with all techniques introduced in Sec.V, including a coarsened graph variant which caches the vertex clusters and their communication relationships by MPI tags. As shown, JSweep’s runtime is constantly less than JASMIN.
Fig.(b)b presents runtime comparison of JSweep and JAUMINbased JSNTU. Again, JSweep shows constant runtime reduction, and with increasing number of cores the comparative advantage becomes slightly bigger.
Besides, TableI compares parallel efficiency of JSweep with other work in literatures. We can see that for Kobayashi problem, JSweep demonstrates comparable scalability with Denovo’s KBAbased implementation. For unstructured ball (sphere) mesh of tetrahedrons, JSweep scales worse than the manually implemented datadriven algorithm PSDb[27]. However, note that JSweep is a solution of general framework. Due to the lack of common public problems and availability of the systems, it is difficult to compare frameworks directly.
Vii Related Work
The idea of patchcentric abstraction is partly inspired by the vertexcentric models [38] in graphparallel frameworks [38][39]. In a vertexcentric model, user defines a vertexprogram for a single vertex and the framework lifts the vertex computation to the whole graph, conceptually in parallel. However, unlike vertex in graph, patch is not the basic element of mesh, which means patchcentric datadriven abstraction has fundamental difficulties including partial computation, priority inversion and multitasks on a single patch. In this paper we comprehensively addresses these issues and formalize a general patchcentric datadriven approach. In fact, the patchcentric abstraction can be seen as a straightforward extension to existing patchbased frameworks (see a survey in [23]), such as SAMRAI [28], (partbased) PUMI [29] and especially JAxMIN[1][2], in which the mesh is decomposed into and managed by patches.
Taskbased programming models, such as PaRSEC [36] and more general Charm++ [37], are also suitable to implement locationbased datadriven computations. For example, a recent work [22] has implemented a PaRSECbased sweep solver on 3d cartesian meshes, demonstrating high efficiency with 34% of peak performance at 384 cores. Compared to JSweep, however, these taskbased frameworks are not wellabstracted for meshspecific parallel computation and thus require users to remap many conceptions.
Viii Conclusions and Future Work
We have presented JSweep, a generic patchcentric datadriven framework integrated in the JAxMIN infrastructure. In particular, we propose the patchcentric datadriven abstraction whose essential idea is extending the concept patch as a logical processing element that is fully reentrant. Also, our abstraction supports multiple tasks on a single patch and arbitrary patch priority strategies. Further, targeting contemporary HPC systems of multicore cluster architecture, we implemented a high performance runtime system to map the patchcentric datadriven computation to underlying system resources. Based the above approach, we implemented a new parallel sweeps algorithm as a component in JAxMIN, featuring patchangle parallelism, vertex clustering and hierarchical priority strategy. Evaluation with two real Sn software packages demonstrates that JSweep can scale to at least 49,152 cores for unstructured meshes and 76,800 cores for structured meshes with reasonable parallel efficiency.
What distinguishes JSweep from most counterparts is that we consider sweep computations in the context of general meshbased application frameworks which would be critical in future coupling of multiple multiphysics simulations. Besides transport sweeps, our abstraction also supports other datadriven algorithms well, e.g., particle trace which we have implemented as another component in JAxMIN. Given the increasing importance of datadriven computation and demands on coupling multiphysics simulations, we believe efficient support to both BSP and asynchronous datadriven models are necessary to construct high performance applications.
References
 [1] Z. Mo, A. Zhang, X. Cao, Q. Liu, X. Xu, H. An, W. Pei, S. Zhu, JASMIN: a parallel software infrastructure for scientific computing, Frontiers Computer Science of China, 2010, 4(4):480–488.
 [2] Q. Liu, W. Zhao, J. Cheng, Z. Mo, A. Zhang and J. Liu, A programming framework for large scale numerical simulations based on unstructured meshes, in Proc. HPSC, NewYork, Apr. 6–8, 2016.
 [3] http://top500.org/2015nov, Nov., 2015.
 [4] R. L. Bowers, J. R. Wilson, Numerical modeling in applied physics and astrophysics, Jones and Bartlett publishers, 1991.
 [5] J. Bey, G. Wittum, On the robust and efficient solution of convection diffusion problems on unstructured grids in two and three space dimensions, Applied Numerical Mathematics, 1997, 23(1):177–192.
 [6] F. Wang, J. Xu, A crosswind strip block iterative method for convectiondominated problems, SIAM Journal of Computing, 1999, 21:646–665.
 [7] T. Downar, A. Siegel, C. Unal. Science Based Nuclear Energy Systems Enabled by Advanced Modeling and Simulation at the Extreme Scale. White Paper on Integrated Performance and Safety Codes, 2009.
 [8] R. Baker, R. Alcouffe. Parallel 3D Sn Performance for MPI on CrayT3D. In Proc. Joint International Conference on Mathematics Methods and Supercomputing for Nuclear Applications, New York, Oct., 1997.
 [9] R. Baker, K. Koch. An Sn algorithm for the massively parallel CM200 computer. Nuclear Science and Engineering, 1998, 28: 312–320.
 [10] Lawrence Livermore National Laboratory, Ardra: Scalable parallel code system to perform neutron and radiation transport calculations, http://www.llnl.gov/casc/ardra.
 [11] W. Hawkins, et al., Efficient Massively Parallel Transport Sweeps, Trans. Am. Nucl. Soc., 107, 477, 2012.
 [12] Los Alamos National Laboratory. The ASCI Sweep3d Benchmark. http://www. ccs3.lanl.gov/pal/software/sweep3d.
 [13] G. Valiant. A bridging model for parallel computation. Communications of the ACM, 1990, 33(8):108–111.
 [14] J. Misra. Detecting termination of distributed computations using markers. In Proc. PODC, pages 290–294, 1983.
 [15] S. Plimpton, B. Hendrickson, S. Burns, W. McLendon. Parallel algorithms for radiation transport on unstructured grids. In Proc. SC, Dallas, Nov., 2000.
 [16] Z. Mo, A. Zhang and X. Cao. Towards a parallel framework of gridbased numerical algorithms on DAGs. In Proc. IPDPS, Greece, 2006.

[17]
Hewitt C, Bishop P, Steiger R. A Universal Modular Actor Formalism for Artificial Intelligence. In
Proc. IJCAI, San Francisco, 1973.  [18] G. Karypis, V. Kumar. Multilevel graph partitioning schemes. In Proc. of ICPP, UrbanaChampain, Aug., 1995, pp.113–122.
 [19] B. Hendrickson, R. Leland. A multilevel algorithm for partitioning graph. In Proc. of SC, San Diego, 1995.
 [20] S. Pautz. An Algorithm for Parallel Sn Sweeps on Unstructured Meshes. Nuclear Science and Engineering, 2002, 140(2): 111–136.
 [21] M. Mathis, D. Kerbyson. A General Performance Model of Structured and Unstructured Mesh Particle Transport Computations. Journal of Supercomputing, 2005, 34:181–199.
 [22] S. Moustafa, M. Faverge, L. Plagne and P. Ramet. 3D Cartesian Transport Sweep for Massively Parallel Architectures with PaRSEC, In Proc. IPDPS, 2015.
 [23] A. Dubey, et al., A survey of high level frameworks in blockstructured adaptive mesh refinement packages. J. Parallel Distrib. Comput. (2014), http://dx.doi.org/10.1016/j.jpdc.2014.07.001.
 [24] W. A. Rhoades and D. Simpson, The TORT ThreeDimensional Discrete Ordinates Neutron/Photon Transport Code, ORNL/TM13221, Oct., 1997.
 [25] T.P. Cheng and L. Deng, JSNTS manual, IAPCM, 2015.
 [26] J.X. Wei, JSNTU (3DSn) manual, IAPCM, 2010.
 [27] G. Colomer, R. Borrell, F. Trias, and I. Rodrieguez, Parallel algorithms for Sn transport sweeps on unstructured meshes. Journal of Computational Physics, vol. 232, no. 1, pp. 118–135, 2013.
 [28] SAMRAI. https://computation.llnl.gov/casc/SAMRAI, May 31, 2010
 [29] D. A. Ibanez, E. S. Seol, C. W. Smith and Mark S. Shephard, PUMI: Parallel Unstructured Mesh Infrastructure. ACM Transactions on Mathematical Software, 2015.

[30]
G. Davidson, T. Evans, J. Jarrell, S. Hamilton and T. Pandyam, Massively Parallel Threedimensional Transport Solutions for the keigenvalue Problem,
Nuclear Science and Engineering, 2014, 177:111125.  [31] T. Evans, A. Stafford, R. Slaybaugh and K. Clarno, Denovo: a new threedimensional parallel discrete ordinates code in scale. Nuclear Technology, 2010, 171(8), 171200.
 [32] Z. Mo, A. Zhang, and Z. Yang, A new parallel algorithm for vertex priorities of data flow acyclic digraphs, The Journal of Supercomputing, vol. 68, no. 1, pp. 49 64, 2014.
 [33] S. Pautz and T. Bailey, Parallel Deterministic Transport Sweeps of Structured and Unstructured Meshes with Overloaded Mesh Decompositions, Proc. Joint International Conference on Mathematics and Computation (M&C), Supercomputing for Nuclear Applications (SNA) and the Monte Carlo (MC) Methods, Nashville, TN, April 1923, 2015.
 [34] P. Song, C.L. Zhai, S.G. Lee, et.al, LAREDI: The Integrated Code for Laserdriven Inertial Confinement Fusion. High Power Laser and Particle Beam (in Chinese), 2015, 27(3):5460.
 [35] D. R. Gaston, C. J. Permann, J. W. Peterson, A. E. Slaughter, D. Andrsie, Y. Wang, M. P. Short, D. M. Perez, M. R. Tonks, J. Ortensi, Ling Zou and R. C. Martineau, Physicsbased multiscale coupling for full core nuclear reactor simulation, Annals of Nuclear Energy, 2015, 84:45–54.
 [36] G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemarinier and J. Dongarra, DAGuE: A generic distributed DAG engine for High Performance Computing, Parallel Computing, vol.38, no.12, 2012.
 [37] L.V. Kale, E. Bohm, C. L. Mendes, T. Wilmarth and G. Zheng, Programming petascale applications with Charm++ and AMPI, Petascale Computation: Algorithms Appl. 1(2007): 421–441.
 [38] G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser and G. Czajkowski. Pregel: a system for largescale graph processing. In SIGMOD, 2010.
 [39] J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin. PowerGraph: distributed graphparallel computation on natural graphs. In OSDI, 2012.
 [40] J. Yan, G. Tan, Z. Mo, N. Sun, Graphine: programming graphparallel computation of large natural graphs for multicore clusters, In IEEE Transaction on Parallel and Distributed Systems, 27(6):16471659, 2016.
Comments
There are no comments yet.