Composite Numerical Methods and Scalable Tile Algorithms

Numerical modeling is an important tool when studying various natu-ral processes and phenomena. Fractional diﬀusion can be used for modeling many processes in biology, for example in silico experiments in molecular biology and medicine design, protein diﬀusion within cells, complex media geometry, etc. The problem is usually reduced to a system of linear algebraic equations and in many cases this system has a dense co-eﬃcient matrix. Numerically solving such problems with the traditional LU factorization is a computationally expensive endeavour – O ( n 3 ). In this paper we explore the use of a hierarchical compression method based on Hierarchical Semi-Separable compression and ULV-like factorization from the STRUctured Matrices PACKage (STRUMPACK) software library for a ﬂow around airfoils problem discretized with Boundary Element Method and fractional diﬀusion problem discretized with the Finite Element Method. The HSS based method promises better overall computational complexity of O ( r 2 n ) for problems with suitable structure – low rank oﬀ-diagonal blocks. Here r is the maximum rank of the oﬀ-diagonal blocks. We present analysis of the performance and accuracy of the HSS based method and compare it with the state of the art direct LU factorization solvers. This paper is based on the PhD thesis “Composite Numerical Meth-ods and Scalable Tile Algorithms” of the author defended on 17.05.2022 in the Institute of Information and Communication Technologies at the


Introduction
Numerical solutions of large-scale problems require the use of high-performance computer systems, as well as specialized hardware and software -graphics cards, accelerators, high-speed communication between the system's servers, software standards and packages for communication between processor cores, GPUs/accelerators and nodes, software packages implementing effective numerical methods and much more.
There are a number of methods for discretization of differential equations; for example, the mesh methods such as finite element method, boundary elements method and finite differences method. After applying such methods the problem is reduced to solving large systems of linear equations. The Gaussian eliminated is the universal method for solving such problems. In the general case it has high computational complexity -O(n 3 ), where n is the number of unknowns [1].
When the discretization of the differential equation is carried out with the boundary element method or when the finite element method is applied to non-local problems (such as the examined in this paper anomalous (fractional) diffusion) [2], the arising matrix is dense. One way of lowering the computational complexity of solving problems with such matrices is the application of hierarchical compression introduced by Hackbusch [3]. With this method the structure of the matrix is utilized. The aim is to reduce the required memory and to improve the computational complexity. Here the term dense matrix structure is understood as the presence of low-rank off-diagonal blocks. This property allows the representation of the off-diagonal blocks as the product of smaller matrices. There are several types of hierarchical matrices: H [3], H 2 [4], Hierarchically Semi-Separable (HSS), etc.
This paper is based on the abstract [5] of the PhD thesis [6].

Overview of key results in the field
The huge progress in the capabilities of modern high-performance computing systems further underlines the role of efficient numerical methods and parallel algorithms. Supercomputer simulations are crucial for development in a number of advanced areas. Examples are in silico molecular biology and medicine design [7,8], analysis of turbulent flows [9], non-destructive testing [10], 3D image processing [11], fluid dynamics [12] and many others.
After suitable discretization, the mathematical models are usually reduced to problems of linear algebra, among which central role has the solution of systems of linear algebraic equations. For this purpose specialized software tools are developed.
In the general case variants of the Gaussian method are employed in order to solve systems of linear algebraic equations with dense matrices. Such methods utilize the sequential elimination of the unknowns. In general, the dense matrix is thought to be homogeneous, as it is not assumed to contain zeros. The Gaussian elimination method has computational complexity O(n 3 ). In this work an alternative approach based on hierarchical compression is investigated. The goal is to reduce the computational complexity of the solution process. Here by structure of dense matrices we understand the existence of low-rank off-diagonal blocks. More to the point, such property is found in a matrix approximating the original. The existence of such suitable structure is the basis of the hierarchical compression and the arising methods for solving systems of linear algebraic equations for classes of computational mathematics problems. Hierarchical compression is introduced by Hackbusch in [3], where the so-called H-matrices are studied. Other types of hierarchical compression are H 2 -matrices [4] and hierarchical semi-separable matrices (HSS) [13]. A theoretical basis for HSS compression methods can be found in [14].
A large part of this paper is dedicated to the numerical solution of fractional diffusion problems. Fractional (also known as anomalous) diffusion describes non-local processes observed in different physical and social media. Unlike ordinary (local) diffusion, anomalous diffusion includes the so-called fast transitions or tunnel effects. Various examples of mathematical models have been published in the literature for processes and phenomena described by fractional diffusion. Some examples are: flows in strongly non-homogeneous porous media, superconductivity, diffusion of polymers in supercold media [15]; electrodiffusion of ions into nerve cells [16] and photon diffusion diagnostics [17]; image processing and machine learning [18]; spread of viral diseases, computer viruses and crime [19]. The fractional Laplace operator describes anomalous diffusion in space. There are different definitions of the fractional Laplacian. It is important to note that they are not equivalent. For example, in [20] the difference between integral and spectral definitions is studied (see also articles [21] and [22] and the literature in them).

Goals and objectives of this work
The major goals of this paper are: • Comparative analysis of the performance and parallel speedup of frequently used software packages applying direct Gaussian elimination for solving systems of linear algebraic equations with dense matrix on CPUs and accelerators (MICs).
• Analysis of the performance, parallel speedup and accuracy of an approximate method for solving systems of linear algebraic equations based on hierarchical semi-separable compression from the STRUMPACK software package for systems with suitable structure.
• Development of reordering algorithms for the unknowns for systems of linear algebraic equations arising from discretization with finite element method of fractional diffusion. The reordering is aimed at improving the effectiveness of hierarchical semi-separable compression when applied on the stiffness matrix.
• Numerical solution of elliptic and parabolic problems in the field of fractional diffusion, modeled with the integral formulation of the fractional Laplacian and discretized with finite elements.

Research methodology
In this paper we analyze the effectiveness, in terms of performance, parallel speedup and accuracy (for approximate solutions), of tile methods for solving dense systems of linear algebraic equations. For this purpose, software packages are used, in which the studied methods are applied.
For the problem, examined in Section 3 we utilize the parallel program developed in [23] for the discretization and generation of the system of linear algebraic equations. For the fractional diffusion problems, examined in Sections 4 and 5, the MatLab program developed in [2] is used for the generation of the system of linear algebraic equations. We developed MatLab programs for the calculation of the reordering schemes [24][25][26] and the lumped matrix of mass [27].

Content structure
Section 1 provides the motivation for the current work and a short description of the methods and problems. Section 2 has an introductory character and describes the utilized tile methods for solving dense systems of linear algebraic equations, as well as an estimation of their computational complexity. In Subsection 2.1 we provide a short description of the universal direct method Gaussian elimination and the LU factorization based on it. Subsection 2.4 examines the hierarchical methods for solving systems of linear algebraic equations, developed for solving systems with structured matrices (both dense and sparse). The advantages of HSS compression are also described -lower computational complexity for problems with suitable matrix structure.
Section 3 presents numerical results for laminar flow around Joukowsky airfoils. The arising system has a dense matrix and is used as a benchmark for the comparative analysis of the studied tile algorithms.
In Section 4 we examine a two dimensional anomalous diffusion problem modeled with the fractional Laplace operator. Finite element method is utilized for the discretization in time.
Section 5 examines a parabolic in time problem for two dimensional anomalous diffusion.
Section Conclusion presents the concluding remarks summarizing the results obtained in this work. equations with dense systems Many problems in the calculation practice are solved numerically by reduction into a system of linear algebraic equations. For example, when utilizing the boundary element method or the finite element method on a fractional power Laplace operator (fractional diffusion), the arising system has a dense coefficient matrix.

Direct methods
Gaussian elimination is an universal method for solving systems of linear algebraic equations. It is the basis of most direct methods. For example, the LU factorization is based on the sequential elimination of unknowns in the Gauss method. The LU factorization is a primary method, implemented in the high performance software libraries for computational linear algebra.

Gauss method
The Gauss method for solving systems of linear algebraic equations Ax = b involves two parts: (i) Forward elimination. Applies elementary row operations to transform the system into one with an Upper triangular matrix; (ii) Back substitution. Recursively in reverse order the off-diagonal elements in the i-th row of the matrix are eliminated for i = n − 1, n − 2, . . . , 1.
The computational complexity of the Gauss method is determined by the forward elimination [1]:

LU factorization
The LU factorization is expressing the matrix A as a product of two triangle matrices A = LU . Here L is a lower triangular matrix with ones on the main diagonal, and U is an upper triangular matrix. This factorization is calculated with a modified Gaussian elimination and is used in high performance libraries (LAPACK, MKL, ACML, PLASMA, ATLAS, etc.) for solving systems of linear algebraic equations.
The forward elimination can be written as where L 1 , L 2 , . . . , L n−1 are lower triangular matrices with ones on the main diagonal. We can check thatL and L =L −1 are also lower triangular matrices with ones on the main diagonal. Thus: After factorization of A, the system of linear algebraic equations is reduced to solving two systems with triangular matrices. We denote afterwards we will: 1. Solve the system Ly = b with forward substitution; 2. Solve the system U y = y with backward substitution; The numerical complexity of the factorization is O 2 3 n 3 , while the forward and backward substitutions have -O n 2 .

Hierarchical matrices. Methods for solving systems of linear algebraic equations with hierarchical semi-separable compression
Hierarchical semi-separable matrices is applied for the approximation of data-sparse matrices. By data-sparse we understand such matrices that have structure that allows approximation by compressed matrices, which could be expressed with lower amount of elements. In the general case the data-sparse matrices do not satisfy the condition to have O(n) nonzero elements. Hackbusch introduced the term Hierarchical matrices in [3] by developing the theory and algorithms for the so called H-matrices.
Methods utilizing hierarchical matrices are a part of the wider group of methods for solving systems with the so called structured matrices. An overview of the existing methods for such methods can be found inside [28], including hierarchically semi-separable matrices. in this work we will examine the efficiency of algorithms based on this class of methods.
STRUMPACK (STRUctured Matrices PACKage) is a parallel software library, that implements hierarchical semi-separable compression for solving dense systems of linear algebraic equations [29]. The algorithm involves three steps: 1. Hierarchically semi-separable compression (approximation) of the matrix of the system. When certain conditions are satisfied this step has computational complexity of O(r 2 n), where r is the maximum rank of the off-diagonal blocks of the approximating matrix. It is calculated during the compression. In the general case the complexity is O(rn 2 ).
one described in the LU factorization above. First O(r) unknowns are eliminated, then the rest of the O(n − r). Computational complexity of this step is O(r 2 n).

Solution.
In this step the compressed and factorized matrix is used along the right hand side to obtain the solution. Computational complexity is O(rn).
The overall computational complexity of the method is O(r 2 n). As seen later this evaluation is valid for certain assumptions. In the general case the complexity is O(rn 2 ).
Hierarchical Semi-Separable compression. In this subsection we will examine in short the hierarchical semi-separable matrices (HSS). They are first introduced by Martinsson in [13]. Algorithms implementing it are described in [29] as a part of the STRUMPACK project, for solving systems of linear algebraic equations with dense matrices. Hierarchical compression may be applied to any non-singular matrix, however it is effective only if the original matrix A has suitable structure -meaning that its off-diagonal blocks have low rank. By effective compression we mean one that approximates the matrix, such that the memory needed to store it is much lower and we can apply transformations on the compressed matrix with lower computational complexity.
We will denote the compress matrix of A with H. The algorithm could be written as: 1. We divide the matrix A in four blocks. We assume that the off-diagonal blocks have low rank (and Singular Value Decomposition, or another rank calculating factorization, can be applied to them): The matrices U , B and V are called generators. If the off-diagonal blocks have low rank, U will be "tall and slim", B will be small and square (or close to) and V will be "short and wide". The relation between rows and columns will depend on the rank r. D are unchanged diagonal blocks. The "big" notation will be described below.
2. Assuming, that the diagonal blocks D also have off-diagonal low rank blocks, they are compressed in the same fashion, the process continues recursively. The second level of recursive compression can be written as: There is a recursive property between the generators of different levels of compression. This is denoted with the "big" notation. The following relations apply The third level of recursive HSS compression can be written as: Generators with the notation "big" can be computed outside of the highest levels of recursive compression. U can be computed from the U τ and U big at the higher level of compression. At the last level U = U big .
In the general case equation 2 is approximate. This means that as a result we obtain an approximation of A, that we will denote as H ≈ A.
To achieve this a suitable threshold ε must be supplied. It is necessary for the calculation of the generators. When a larger threshold is used, the generators are smaller and the compressed matrix takes less memory and allows more effective arithmetic operations with it, but this is at the price of lower accuracy. If a smaller threshold is chosen it is exactly the opposite.
As we will show in the later sections, the ordering of the unknowns while assembling the matrix A is pivotal for the effectiveness of the HSS compression. If A is scrambled randomly, that will almost assuredly destroy any suitable for this method structure.
For certain classes of problems it is possible to reorder the unknowns in such a way, that the arising structure of the matrix is improved. For example, in [30], several methods for clusterization are studied for Kernel Ridge Regression. In Section 4 we will propose and analyze several methods for reordering of the unknowns for a system of linear algebraic equations, arising from the discretization of an elliptic problem with a fractional power of the Laplace operator (fractional diffusion problem). Compression with randomized sampling. The HSS algorithm deployed in STRUMPACK is based on applying randomized sampling, which utilizes multiplication of random vectors with the original matrix A. This method is first proposed by Martinsson in [13]. Instead of the explicit form of A a function that accesses (or calculates on demand) and another one that calculates a product of A with a vector can be used. The advantages of this approach, as well as an adaptive random sampling algorithm are studied by Gorman et al. in [31]. The Randomized sampling is also useful when integrating HSS kernels in sparse solvers [32].
In the general case the computational complexity of a matrix-vector product is O(n 2 ). This leads to O(rn 2 ) for the HSS compression algorithm. For certain classes of problems r is much smaller than r. For example, for 2D Poisson problems (FEM) r is a constant, while for 3D Helmholtz (BEM) it rises slowly with n. If a fast algorithm (O(n)) for multiplying the compressed matrix with a vector is supplied, the complexity of the compression can be lowered to O(r 2 n).
ULV-like factorization and solution. The compressed matrix H in HSS form can be factorized with a special form of LU factorization known as ULV factorization [33]. This factorization uses orthogonal transformations to first eliminate n − r. The rest r unknowns are then eliminated with LU factorization. The factorization employed inside STRUMPACK is realized as ULV-like factorization, that instead of using orthogonal transformations utilizes the HSS structure of the compressed matrix H. The complexity of this step is O(r 2 n) This process is visualized in Figure 1.
After applying the ULV-like factorization, the system of linear algebraic equations Ax = b is reduced to solving two systems with triangular matrices. The computational complexity of this step is O(rn) [29].

Boundary Element Method for numerical solution of a two dimensional flow around airfoils problem
This section studies a numerical method for a computational simulation of laminar flow around Joukowsky airfoils. This work employs the method described in [34]. We have developed an implementation for a cascade of airfoils in ideal fluid. The method is based on spline collocation with interpolation in parts. A parallel C program is developed in [23] for this problem.
After discretization of the integral equation with the boundary element method the problem is reduced to a dense system of linear algebraic equations. The results from the studied in the paper hierarchical method are compared with results obtained with Gaussian elimination, implemented in several popular software packages. On the CPUs we compare the performance of Intel Math Kernel Library (MKL) and Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA), while on the Intel Xeon Phi coprocessors (abbreviated as MICs from the Many Integrated Core architecture) the performance of MKL is compared with Matrix Algebra on GPU and Multicore Architectures (MAGMA) for MIC architecture (abbreviated as MAGMA MIC).

Boundary Element Method calculating the flow function in an ideal fluid in unbounded two dimensional domain
Let Ω ⊂ R 2 be an unbounded multi-connected domain with a smooth enough internal border S. The flow function Ψ satisfies the Laplace equation: in Ω ⊂ R 2 and can be written as where r 2 (P, Q) = (x − ξ) 2 + (y − η) 2 , P = (x, y), Q = (ξ, η) and dσ Q is a measure on S. The first term on the right hand side represents a simple layer (stream function of the vortex layer), where γ(σ) is the density of the layer, Ψ ∞ is a harmonic function added to the potential of the layer in order to satisfy the condition at infinity for external boundary problems. The velocity field

Fluid flow around airfoils
In this section we will examine the problem of fluid flow around Joukowsky airfoils. We assume that the flow at infinity has homogeneous velocity − → C ∞ = (1, 0). Here S denotes the contours of the airfoils. For the examined problem the flow function Ψ satisfies the Laplace equations (3). The airfoils S are impermeable, therefore the following boundary condition is in force Ψ| S = K = const. In order to satisfy the conditions at Here, we have Ψ ∞ (P ) = γ ∞ (P ).
The boundary condition takes the form Finally, the Kutta-Joukowsky's condition γ(A) = 0, is used to obtain a unique solution of the boundary value problem, where A are the points on the sharp tip of the airfoils.

Discretization
For the numerical solution of the integral equation (5) we apply the boundary element method. The problem is reduced to a system of linear algebraic equation of the form: (Aγ) s = f (s).
We look for a numerical solution is the Lagrangian basis, corresponding to the discretization mesh S h on the airfoils and γ i = γ h (s i ) , i = 1, . . . , n are the BEM nodal unknowns. The collocation method is applied to the mid points of the boundary elements from S h . Following [23] we obtain the system of linear algebraic equations

Analysis of numerical experiments on computers with shared memory
Computing the matrix D has computational complexity O(n 2 ). The calculation of drag and lift forces have complexity O(n). The focus of this work is on the most computationally complex part -the solving of the system of linear algebraic equations.

LU factorization
CPU processors with shared memory. PLASMA (Parallel Linear Algebra Software for Multicore Architectures) [38] is a software library for solving systems of linear algebraic equations implementing the standard LAPACK library. The effectiveness of PLASMA is based on using highly optimized BLAS library (Basic Linear Algebra Subprograms) in which the basic linear algebra operations are implemented -vector, matrix-vector and matrix-matrix multiplications. For this level of computation we employ the libraries MKL BLAS and ATLAS (Automatically Tuned Linear Algebra Software) BLAS [39].
On Table 1 we present the results of the numerical experiments for solving systems of linear algebraic equations obtained when applying the boundary element method for discretization of a flow around airfoils problem. We compare the sequential and parallel execution times for PLASMA + ATLAS, PLASMA + MKL and MKL solvers for n = 5 000 and n = 40 000 and varying the number of threads used.
The results show good parallel speedup for all tested libraries up to 16 threads. The speedup achieved comes close to 15, which is close to the theoretical maximum of 16.
The parallel effectiveness of PLASMA with MKL and MKL is quite similar, increasing up to 94% for the larger problem. They outperform more than 1.5 times PLASMA with ATLAS. One possible reason for this could be that the gcc compiler doesn't employ vectoring as well as icc.     MIC coprocessors. In this subsection we analyze the parallel effectiveness of the Intel Xeon Phi 7120P accelerators (MICs). The MICs are designed for massive parallelism and vectorization as required in High Performance computing. Every MIC has 61 cores and each core can run 4 threads simultaneously for a total of 244 threads. We use the Offload mode which reserves one of the cores for communication with the CPU, thus we can use up to 60 cores (240 threads).
On Table 2 we present the results from numerical experiments with sizes n = 5 000 and n = 40 000, varying the number of threads from 1 to 240. The results show very good performance of MKL in comparison with MAGMA MIC. This may be because of better communication between the threads in MKL. For the biggest problem we achieve parallel speedup of 32.
On Figure 2 we compare the performance of the studied software libraries for the CPU and MIC architectures. The performance of PLASMA with MKL is better than that of MAGMA MIC. That may be due to better communication.
In conclusion we note that the MKL package has better performance both in the CPUs and the MIC coprocessor. The best time on the coprocessor is 4 times faster than the best on the CPU.

Hierarchical Semi-Separable compression
The Hierarchically Semi-Separable compression, implemented in the software library STRUMPACK is approximate. This means that the compressed matrix H approximates the original matrix A. The user must provide two thresholds -absolute ε abs and relative ε rel [31]. In the results presented on this paper we fix the absolute threshold at ε abs = 10 −8 and vary the relative one ε rel = 10 −2 , 10 −4 , 10 −6 , 10 −8 and 10 −12 .
Comparative analysis of hierarchical and LU factorization on CPUs with shared memory. In this subsection we analyze the performance of the HSS compression method and its software implementation in STRUMPACK in comparison with Gaussian elimination and its best (based on the analysis above) implementation MKL, where we use LU factorization. The sequential For the sequential experiments STRUMPACK is more efficient than the best direct solver employed -MKL. STRUMPACK shows lower parallel speeduparound ∼2 to ∼5. This is due to the more complex recursive structure of the HSS compression.
The accuracy and computational effectiveness of the hierarchical method rely on the max rank of the off-diagonal blocks r, which is in turn dependent on the chosen error thresholds and the structure of the original matrix A. Higher rank r will result in smaller relative error R relative as well as longer solution time and vice versa.
The analysis of the presented results shows that in order to achieve high accuracy for the method we require very small threshold corresponding to higher rank r. In these case HSS compression may not be sufficiently effective.

Parallel scalability on computer systems with distributed memory
This section is concentrated on some specifics and hardships when working with HPC systems with hybrid architecture. Those machines have distributed memory on the server level and shared memory on each server. As in the previous section the comparative analysis includes the parallel libraries MKL and STRUMPACK. In order to solve the system of linear algebraic equations, arising from the BEM discretization of flow around Joukowsky airfoils, we employ one or two servers connected with Ethernet. We analyze the following variants with MKL and STRUMPACK: Sequential; Parallel with 24 OpenMP threads; Parallel with 24 MPI processes on 1 server; Parallel with 48 MPI processes on 2 servers; Hybrid parallelization with 2 MPI processes on 2 servers, each with 24 OpenMP threads.

LU factorization
On Figure 5 we show the execution times for solving the system of linear algebraic equations with MKL. The best times are obtained when using a single server with OpenMP, followed by MPI Figure 5b. This is explained by the slower Ethernet communications when employing more than a single server.

HSS compression
When solving the system with HSS compression, the obtained parallel speedup is lower than when using the direct Gaussian solver Figure 6. This can be explained with the recursive structure of the HSS compression.
In the numerical experiments with the lowest relative threshold ε rel we obtain the best computational time when using MPI on a single server. This could likely be explained by the fact that OpenMP parallelization is implemented later than the MPI optimization.
The behavior of the parallel speedup is drastically changed when employing two servers. This is due to the much slower (relatively) connection between them -1000 Mb Ethernet. We can conclude that effectiveness could significantly

Concluding remarks
Central part in the presented results is given to the examination of the Hierarchically Semi-Separable compression method. The experimental comparative analysis is based on the implementation in the STRUMPACK package. It shows much better performance than the Gaussian solvers, utilizing LU factorization. At the same time the parallel speedup of STRUMPACK is found lacking. This is explained with the more complex hierarchical structure of the algorithm.
The accuracy and computational effectiveness of the HSS compression depends on the choice of the absolute and relative thresholds. These parameters must be determined by the user. The presented analysis shows how to achieve best performance with the given accuracy.

Finite Element Method for numerical solution of a two dimensional fractional diffusion problem
The fractional elliptic in space problems of power α ∈ (0, 1) are utilized in modeling anomalous diffusion problems. The arising boundary problems are non-local and in the general case the numerical solution of such problems is an extremely computationally expensive process. Such models are used in image processing, financial mathematics, electromagnetostatics, peridynamics, modeling flow in porous media and many others.
The presented numerical experiments are for model problems in a square domain. The fractional Laplacian is defined through the Riesz potential. The theoretical basis and the specialized finite element method for its numerical solution can be found in [40] by Acosta et. al. The algorithmic implementation of the method can be found in [2] from the same authors.
The performance of several software packages implementing Gaussian elimination was analyzed in Section 3. Here we will use only the most effective of them: Intel's Math Kernel Library (MKL). In this section we will analyze the performance of the HSS compression based algorithm, implemented in the STRUMPACK library. We will employ several reordering schemes in order to improve the structure of the stiffness matrix.

Problem Statement
The fractional Laplacian can be defined as where P.V. means principal value, d is the number of dimensions, α ∈ (0, 1).
The normalized constant C(d, α) can be written as where Γ is the gamma function.
In this section we utilize the integral definition of the fractional Laplace operator according to the definitions in [2] by Acosta et. al. The examined boundary value problem can be defined as Here Ω ⊂ R is a bounded domain, Ω c is the compliment of Ω in R d and f (x), x ∈ Ω is the right hand side with enough smoothness. The variational formulation is obtained from (9) by multiplying with a test function and integrating by parts. The equation for the weak solution is: find u ∈ H α (Ω), such that The scalar product of u and v can be defined in the Hilbert space H α (Ω) with norm · H α (Ω) = · L 2 (Ω) +|·| H α (Ω). Here |·| H α (Ω) is the Aronszajn-Slobodeckij half norm. u, v H α R d can be written as We will point out that integration is carried out over the whole space R d . Correctness of the variational definition of the problem (10) as well as the existence and uniqueness of the solutionH α (Ω) follow from the Lax-Milgram lemma.

Finite element settings
Let T be an admissible triangulation of the domain Ω containing N T triangular elements. We examine the finite element space V h of continuous linear in part functions over T . Let {ϕ 1 , . . . , ϕ N } ⊂ V h be the Lagrangian nodal basis, corresponding to the internal nodes x 1 , . . . , x N . Then ϕ i (x j ) = δ j i . Let T ∈ T is a an element on the triangulation and let's denote with h T and ρ T the diameter and the inner radius. We also write h = max T ∈T h T . We examine shape-regular triangulations, such that τ > 0 independent of T such that h T ≤ τ ρ T , ∀T ∈ T .  Under these conditions for any α ∈ (0, 1) the discreet analog of the variational problem (10) can be written as Here u h = j u j , ϕ j denotes the numerical FEM solution of the stationary diffusion problem (9).
Solving the discrete variational problem (11) is reduced to solving the system of linear algebraic equations KU = F , where U = (u j ) ∈ R N are the nodal unknowns. The stiffness matrix K is symmetrical and positively defined, therefore the system has a single solution.
In the proposed FEM variant from [40] integration is reduced to a ball domain B ⊃ Ω, such that the distance between the border Ω and the compliment B c is sufficient. This introduces the additional triangulationT A over V \Ω, such that the summary triangulationT = T ∪ T A over B is admissible. An example of such triangulation is shown on Figure 7. The nodes denoted with bold font correspond to the unknowns u j .
Let's denote the elements in the triangulation over B with MT . Then the elements inside the stiffness matrix K may be written as where for the integrals I and J the following equations hold true

Reordering of the unknowns
In Subsection 2.4 we have mentioned that for certain classes of problem it is possible to significantly improve the structure of the matrix by reordering the unknowns. The analysis carried out in this subsection shows that for the fractional diffusion problem this is necessary.
The original ordering of the triangulation nodes is shown on Figure 9a. This ordering is obtained from the MatLab function initmesh when generating the finite element mesh. The structure of the matrix corresponding to this ordering is not suitable for the HSS solver from STRUMPACK.

Reordering by "Y" coordinate -"top"
With this reordering scheme the nodes in Ω are reordered by their "Y" coordinate. The obtained reordering and matrix structure can be seen on Figure 9b.

Reordering by lines -"stripes"
With this reordering scheme the nodes are reordered on horizontal lines. An example of this reordering and matrix structure can be see on Figure 9c.

Reordering on a spiral -"snake"
This algorithm reorders the unknowns on a spiral, similar to a coiled snake. The reordering and the structure of the arising matrix can be seen on Figure 9d.

Reordering with Nested Dissection
The Nested Dissection method applies a "divide and conquer" heuristic for the division of a graph representing a sparse symmetrical matrix. The algorithm is executed in three steps: 1. Construction of an undirected graph corresponding to the triangulation, such that the vertices are the mesh nodes and the edges are the sides of the corresponding triangles.
2. Recursive partitioning of the graphs with separators (small sets of vertices which, when removed, divide the graph in subgraphs).
3. Reordering of the nodes in accordance with the recursive structure: First on subgraphs and then by separators.
The separators for an example square domain can be seen on Figure 8. The reordering and the matrix structure are visualized on Figure 9d.

Reordering with recursive bisection
Recursive Bisection [41] is examined as another method for division of graphs. Unlike the Nested Dissection the graph is divided in two subgraphs by removing edges. This process continues recursively. The subgraphs in each recursive partitioning are numbered sequentially.
Recursive Bisection is used to balance the load on distributed memory HPC machines [41].

Analysis of numerical experiments on computational systems with shared memory
The experimental results presented in this section are obtained on a single server of the AVITOHOL supercomputer. The original ordering is obtained from the mesh generation implemented in the program from [2]. In order to improve the effectiveness of the hierarchical solver from STRUMPACK we analyze several reordering schemes.
The numerical experiments are for a fractional Laplacian with power α = 0.5. Visualization of the numerical solutions for the problem in a square and circle domain are shown on Figure 10. In this paper we show results only for the square domain.
In this section we analyze the performance of the two examined solvers, based on their software implementation: LU factorization from MKL and HSS compression from STRUMPACK.

Square domain
On Figure 11 we present the sequential execution times for solving the system of linear algebraic equations. In most experiments the hierarchical solver shows better times with recursive bisection, followed by the nested dissection, "top" and "stripes". Experiments show that MKL has better performance for all values of the relative threshold except for ε rel = 10 −2 , where the recursive bisection, "stripes" and "top" reorderings show better results. We can also conclude that, except for the results without reordering and with the "snake" algorithm, STRUMPACK shows similar performance to MKL.
Off-diagonal rank. On Figure 12 we present the off diagonal rank r, calculated in the HSS compression step. On Figure 13 we show the relation n/r. The value of r is a measure for the effectiveness of the HSS compression on a given matrix. The smaller the rank is, the more effective the compression is. With the original ordering r has higher ranks that vary between 1/3 and 1/4 of the number of unknowns. The "snake" reordering shows the next highest ranks. The rest of the reorderings have similar values for the ranks with the recursive bisection having the edge on most of the experiments. The role of the maximum off-diagonal rank is visible from the results analyzed in the previous section, shown on Figure 11. Parallel times and speedup. In this subsection we analyze the parallel speedup. The numerical experiments are carried out with increasing problem size n ∈ [2 131, 32 302]. The parallel results obtained with 16 threads are presented in Figure 14. The graphs have similar behavior and for the larger values of n the acceleration becomes linear. This is according to the theoretical assessments. We will also note that for ε rel = 10 −2 and "top" reordering HSS compression has better execution times for the two largest sizes n = 24 892 and 32 302.
In the rest of the case the MKL solver is faster. This is due to the more complex recursive structure of the HSS compression. For larger problems we can expect STRUMPACK to show better times than MKL when using smaller thresholds, too.
Analysis of the accuracy of the HSS based solver. When applying the HSS solver from STRUMPACK we obtain an approximation of the solution. This is because the compressed matrix H is an approximation of the original stiffness matrix K. As with the previous section we will analyze the error of the method by calculating the relative error R relative (7). Again we will utilize the   direct Gaussian solution as reference. The relative errors for the original ordering and the analyzed reorderings is presented on Table 4. For most of the experiments the relative error is similar to the relative threshold ε rel . There are a few exceptions, where the accuracy is lower than expected. For example, when using the "stripes" and "top" reorderings for the the largest problems (n = 32 302, the relative error is substantially larger than ε rel . This is another indicator of the need for suitable reordering methods.
The relative error depends on the effectiveness of the compression. This means that when the maximum off-diagonal rank r is small, the compression is more effective. This leads to smaller computational times, but larger relative error R relative .

Concluding remarks
The experimental comparative analysis is based on the realization of HSS compression and ULV-like factorization in the STRUMPACK software library. The analysis shows the better performance of the sequential algorithms in comparison with tile LU factorization. At the same time the acquired parallel performance and speedup with STRUMPACK are lower, which is explained by the more complex hierarchical and recursive nature of the HSS compression.
The accuracy and computational performance of the HSS compression is highly dependent on the relative ε rel and absolute ε abs thresholds as well as the existence of suitable matrix structure. The observed relative error has similar value to the corresponding relative threshold. In order to improve the structure of the matrix five reordering schemes are studied. The presented numerical analysis shows that the recursive bisection has the best results in most of the experiments.
The structure of the matrix obtained from the fractional diffusion problem is less suitable than the problem investigated in Section 3. This can be explained with the fact that the fractional Laplacian is strongly non-local. The examined reordering schemes significantly improve the effectiveness of STRUMPACK. Despite this the parallel performance of the HSS compression is significantly lower than the direct Gaussian elimination.
One of the advantages of the HSS compression is that when solving sequence of systems of linear algebraic equations, in which the matrix doesn't change, the lower computational complexity of the solve with the factorized matrix O(nr) has advantage over the solving after the LU decomposition -O(n 2 ). Such problem is, for example, the one studied in the next section -a parabolic fractional diffusion with lumped mass matrix.

Finite Element Method for solving two dimensional parabolic fractional diffusion problems
The main interest in this section is solving systems with an already factorized matrix. This step, after HSS compression and ULV-like factorization, has computational complexity O(nr) [33]. For comparison, when applying LU factorization, the solving step takes O(n 2 ) arithmetic operations. With the stationary elliptic problem this step is applied only once and has almost no impact on the performance of the solvers. This is changed when the numerical method is used to solve a sequence of systems of linear algebraic equations. In this case the relative weight of the solving after factorization step is increased substantially. Such problem is the finite element method discretization of a parabolic fractional diffusion problem examined in this section.
In order to generate the matrix of mass an algorithm and a software module are developed. It uses the information for the triangulation geometry T ∈ Ω.
For the discretization in time we will use an implicit Euler scheme with an uniform time step and a lumped matrix of mass.
For the Numerical experiments we use settings analogous to the test example in [42] by Vabishchevich. This allows the numerical results, corresponding to the two methods to be compared. Results presented in this Section are published in [27].

Problem statement
We use the integral representation (9) of the fractional Laplacian. The following parabolic problem with unknown function u( Here [0, T ] is the time interval. The Dirichlet homogeneous boundary conditions are applied as in the previous problem. We apply the same triangulation T ∈ Ω as in the previous section. The following Cauchy problem is obtained for the unknown functions u = (u j (t)) ∈ R N , t ∈ [0, T ] and right hand side f = (f j (t)) ∈ R N . Here K = K ij ∈ R N ×N is the stiffness matrix corresponding to the fractional Laplacian. It has the form defined in (12) and (14). With M L = diag m i L ∈ R N we denote the lumped matrix of mass, where m i L is the concentrated mass at node x i . The algorithm and programming module for the computation of M L can be found in [27].
For the discretization over time we use the implicit Euler method, which in the general case has the form where m is the amount of time steps, m−1 j=0 τ j = T, In this paper we will limit ourselves to the case of a constant time step τ j = τ . Under this condition, each step in (15) comes down to solving the system of linear algebraic equations The mass and stiffness matrices are symmetric and positive definite thus (16) has a unique solution. This means that in the implementation of Euler's method the factorization step is performed once, then we solve m systems with the factorized matrix.
The computational complexity of LU factorization is O n 3 . Then the m steps in time require another O n 2 m arithmetic operations.
In the hierarchical method -HSS compression and ULV-like factorization have a summary complexity of O(rn 2 ) and O(nrm) arithmetic operations are needed to solve the factorized systems. Thus, the effectiveness of the hierarchical method is determined by the max off-diagonal rank r.
Several reordering schemes were introduced in Section 4 for solving the fractional diffusion problem. Here we will show only results for the most effective -recursive bisection.
For the numerical experiments we will use a problem analogous to the parabolic problem studied by Vabishchevich in [42], where the spectral definition of a fractional Laplacian is used. The problem is solved for (x, t) ∈ Ω × [0, T ] = (−1, 1) 2 × (0, 0.1). The solution is determined from the time independent right hand side f (x) = 0, and initial condition u 0 (x) = 100 The discretization in time uses the triangulation from Section 4 with time step τ = T /m, m = 256. The initial condition u 0 (x) and numerical solutions with α = 0.5 for t ∈ {0.025, 0.05, 0.075, 0.1} are shown on Figure 15. The solution obtained is qualitatively similar to the results presented in [42] .

Analysis of numerical experiments on systems with shared memory
The numerical results analyzed in this section are obtained on computer systems with shared memory. As in previous section, the experiment was performed on the AVITOHOL supercomputer. The MatLab program, published in [2], was used to generate the stiffness matrix K and the right part f . For the reordering of the unknowns with recursive bisection we use the algorithm described in Subsection 4.2.5 and the developed code presented in [26]. To calculate the lumped mass matrix M L we use the algorithm presented in [27].
Sequential and parallel experiments. On Figure 16 we present: the execution times of the hierarchical semi-separable compression and ULV-like factorization from STRUMPACK and LU factorization from MKL - (Figures 16a  and 16d); the total time for solving the systems with the factorized matrices of each of the time steps in Euler's method - (Figures 16b and 16e); and the total time - (Figures 16c and 16f). In almost all cases, the numerical experiments show better efficiency of the hierarchical method, and this tends to increase with the number of unknowns n. For the largest system (n = 32 302) the total time with the HSS compression method from the STRUMPACK package are between ∼2.5 and ∼5 times better than when using the LU factorization from the MKL package. These results confirm the theoretical expectations for stronger improvement of the efficiency of the hierarchical solver for the parabolic problem, when compared to the stationary problem discussed in the previous section.  The parallel speed of the HSS based solver from STRUMPACK is shown on Figure 17. The parallel experiments with 16 threads show speedup from ∼3 (for n = 2 131) up to ∼8(for n = 32 302). The lower parallel speedup of the hierarchical method can be explained by the more complex hierarchical and recursive structure of the compression algorithm. The parallel implementation of the solver with HSS factorized matrix also has more complex (and less balanced) structure than the tile LU factorization.
Off-diagonal rank. The maximum off-diagonal rank r, calculated during the HSS compression, is presented on Figure 18a, on Figure 18b the relation n/r is shown. This rank is a measure of the efficiency of the compression, and at the same time is detrimental in estimating the computational complexity. For the examined problem r has much smaller values than n. The compression is efficient, meaning that r/n is small, with the largest values of the relative threshold ε rel . Thus with ε rel = 10 −2 the rank r is between ∼20 and ∼80 times smaller than n, while with the finest threshold ε rel = 10 −8 this relation is between ∼10 and ∼30. This analysis shows that, after reordering with recursive bisection, the matrixK has suitable structure for HSS compression. This is also confirmed by the numerical experiments showcasing the advantage of the hierarchical method over Gaussian elimination (block LU factorization). It is important to note that the compression is approximate. Thus, the higher compression efficiency (obtained with higher threshold values ε rel ) is at the expense of the lower accuracy of the solution.
Analysis of the relative error of the HSS-based solver. The compressed matrix H, obtained after the HSS compression is an approximation ofK. As with the stationary problem in the previous section, we will analyze the relative error R relative (7). The relative errors R relative for four chosen points in time t is presented on Table 5. As with the stationary problem, the relative error is close to the supplied relative threshold ε rel . The presented numerical results also show that R relative doesn't increase substantially with the increase of the time interval. This confirms the stability of Euler's implicit method.

Concluding remarks
The main topic in the presented results is the analysis of the computational efficiency of a method based on Hierarchical Semi-Separable compression and ULV-like factorization and their parallel implementation in the STRUMPACK software package. One peculiarity of the applied implicit Euler method with a constant step τ is that the numerical solution of the parabolic problem is reduced to a sequence of m = T /τ systems of linear algebraic equations with the same matrixK, only changing the right hand side at each time step. This means that the matrixK is factorized only once and the accent falls on the solution of m systems with a factorized matrix. Recall that this step has a computational complexity O(n 2 ) for the Gaussian solver and O(nr) for the hierarchical solver.
The presented analysis shows that, for the considered problem, the rank r is significantly smaller than the number of unknowns n, which determines the advantage of the hierarchical method. As a result, both sequential and parallel solution times utilizing the solver from the STRUMPACK software package are significantly better than the execution times with the LU factorization from MKL.
The relative error R relative is close to the examined values for the relative threshold, growing steadily with the development of the process over time. This confirms that the hierarchical method provides good accuracy with suitably chosen ε rel . The presented numerical results confirm the significant advantage of the Hierarchical Semi-Separable compression method from the STRUMPACK library.

Conclusion
In this paper we analyze the computational efficiency of numerical methods and algorithms for solving systems of linear algebraic equations with dense matrices. The motivation for this study are applications related to the numerical solution of elliptical and parabolic partial differential equations. Two such problems are used in the presented comparative analysis: a) boundary value problem describing laminar flow around Joukowsky airfoils, discretized with the Boundary Elements Method; b) anomalous diffusion inside a bounded domain modeled with the fractional Laplacian, where the finite element method is applied for the discretization. In both cases the problems are reduced to systems of linear algebraic equations with dense matrices. It is shown that the structure of these matrices is suitable for applying a hierarchical method using HSS compression.
An important part of Section 3 is the comparative analysis of the computational complexity of software packages implementing tile LU factorization, a variant of Gaussian elimination. The general conclusion is that the MKL library has better performance than the examined alternative implementations of LU factorization.
The accent in this work is the analysis of the possibility for improvement of the computational efficiency of solving systems of linear algebraic equations with dense matrices with the help of an hierarchical method utilizing HSS compression. This method is implemented in the STRUMPACK software package. In Sections 3 and 4 we examine the performance of the hierarchical algorithm for systems of linear algebraic equations obtained from the application of the boundary element and finite element methods, respectively, for the considered elliptic boundary problems. The analysis shows that the studied dense matrices have a suitable structure for the application of the hierarchical method. This means that the HSS compression finds low-rank off-diagonal blocks.
The sequential experiments affirm the computational complexity measures of the analyzed tile methods. For both problems the hierarchical solver shows better performance than the MKL Gaussian elimination algorithm -the most efficient from the studied software packages implementing LU factorization.
When applying the hierarchical method we obtain an approximate solution of the system. Its accuracy depends on the accuracy of the HSS compression. We base the analysis of the relative error R relative of the numerical experiments on using the solution obtained from LU factorization as reference. For the fractional diffusion problem the relative error is observed to be close to the set threshold ε rel . This can be accepted as a good characteristic for the HSS compression solver. For the flow around Joukowsky airfoils problem the relative error for the corresponding ε rel is larger, but this is compensated with better execution time performance.
It is well known that the quality of the Hierarchical Semi-Separable compression depends heavily on the structure of the dense matrix. With the two dimensional fractional diffusion problem the structure of the originally obtained dense matrix is not suitable for HSS compression. In order to improve it five reordering schemes are applied. The presented analysis shows the advantages of using Nested Dissection and Recursive Bisection.
In Section 5 we study the computational performance and accuracy of the hierarchical solver based on HSS compression for a parabolic problem with fractional in space diffusion. The discretization in time is carried out with an implicit Euler differential scheme with uniform step. Finding the numerical solution of this problem is reduced to a sequence of systems of linear algebraic equations with the same matrix. In this way on every time step we solve a system that is factorized once only. Solving such systems with HSS compression has lower computational complexity -O(nr) in comparison to the LU factorization's O(n 2 ). For the examined parabolic problem the hierarchical solver's execution times are better both for the sequential and parallel experiments. At the same time, thanks to the unconditional stability of the implicit Euler method, the relative error is close to the set relative threshold ε rel .