\( \newcommand{\Acal}{\mathcal{A}} \newcommand{\rr}{\mathbb{R}} \newcommand{\cc}{\mathbb{C}} \newcommand{\nn}{\mathbb{N}} \newcommand{\BigO}{\mathcal{O}} \newcommand{\Hmat}{\mathcal{H}} \newcommand{\Tcal}{\mathcal{T}} \newcommand{\Ical}{\mathcal{I}} \newcommand{\Lcal}{\mathcal{L}} \newcommand{\Ccal}{\mathcal{C}} \newcommand{\Qcal}{\mathcal{Q}} \newcommand{\Kcal}{\mathcal{K}} \newcommand{\Wcal}{\mathcal{W}} \newcommand{\Vcal}{\mathcal{V}} \newcommand{\Ucal}{\mathcal{U}} \newcommand{\Ecal}{\mathcal{E}} \newcommand{\Scal}{\mathcal{S}} \newcommand{\Bcal}{\mathcal{B}} \newcommand{\Hcal}{\mathcal{H}} \newcommand{\Ncal}{\mathcal{N}} \newcommand{\HMAT}{\mathcal{H}-\text{matrix}} \newcommand{\HMATS}{\mathcal{H}-\text{matrices}} \newcommand{\HoneMAT}{\mathcal{H}^1-\text{matrix}} \newcommand{\HoneMATS}{\mathcal{H}^1-\text{matrices}} \newcommand{\HtwoMAT}{\mathcal{H}^2-\text{matrix}} \newcommand{\HtwoMATS}{\mathcal{H}^2-\text{matrices}} \newcommand{\Hone}{\mathcal{H}^1} \newcommand{\Htwo}{\mathcal{H}^2} \newcommand{\dist}[1]{\text{dist}(#1)} \newcommand{\diam}[1]{\text{diam}(#1)} \newcommand{\child}[1]{\text{child}(#1)} \newcommand{\level}[1]{\text{level}(#1)} \newcommand{\depth}[1]{\text{depth}(#1)} \)
\( % Redefine comment commands to be blue \newcommand\mycommfont[1]{\footnotesize\ttfamily\textcolor{blue}{#1}} \SetCommentSty{mycommfont} \)

Portable implementation of a computation kernel for the N-body problem with Kokkos

Table of Contents

1. Foreword

This document summarizes the work completed during Aurélien Gauthier's internship with the CONCACE team, under the supervision of Antoine Gicquel, PhD student in CONCACE. The objective was to explore the Kokkos ecosystem by attempting to implement existing kernels from the ScalFMM library. This document serves as a reflection on that experience and includes code snippets from the developed implementations to highlight some of the features offered by Kokkos. However, note that it is not intended to be a formal guide or comprehensive summary of the official documentation (which can be found here), though pointers to relevant sections are provided. We only explored specific features of Kokkos that suited our needs for a specific target application (detailed in the introduction), and thus, this document does not aim to be an exhaustive overview of the library. Nonetheless, we hope that our experience might be valuable to others interested in learning more about Kokkos. At the time of writing, the latest release of Kokkos is version 4.3.01, please note that some features presented here may change as the API evolves. Additionally, certain results are preliminary and may require further investigation (see the result section). Finally, feedback is warmly welcome!

2. Introduction

Before diving into the heart of the matter, let’s briefly introduce the context of the internship.

2.1. Context

We are interested in physical applications where we evaluate pairwise interactions between \(N\) bodies (by "bodies" we mean particles, planets, celestial objects, etc.). Each of the \(N\) bodies is located at a position \(x_i \in \mathbb{R}^n\) (\(n=2\) or \(n=3\) most of the time). Generally, this can be expressed as follows:

\begin{equation} \Phi_i = \sum_{j = 1, i \neq j}^{N}{ k(x_i, x_j)m_j} \quad \forall i \in \{1,\dots,N\}. \label{eq:n-body} \end{equation}

In equation \eqref{eq:n-body}:

  • \(x_i\) is the position of body \(i\).
  • \(m_i\) represents the input for body \(i\) (for example, it could be an electric charge or mass).
  • \(\Phi_i\) represents the output for body \(i\) (for example, it could be a potential or a force).
  • \(k(x,y)\) if the interaction function (or interaction kernel). This function often has a singularity when \(x\) and \(y\) are close to each other, but it becomes smooth and decays rapidly as the distance from the singularity increases.

We will now provide some examples of this physical problem.

2.1.1. Electrostatic potential

In this case, we consider a set of \(N\) particles, each with a charge \(q_i\), and we want to compute the potential \(p_i\) for each one of them using the following expression:

\begin{equation} p_i = \frac{1}{4\pi \epsilon_0} \sum_{j = 1, i \neq j}^{N}{ \frac{q_j}{\|x_i - x_j\|} } \quad \forall i \in \{1,\dots,N\}, \label{eq:potential} \end{equation}

where \(\epsilon_0\) is the vaccum permeability. Thus, the interaction function is defined as:

\begin{equation*} k(x,y) = \frac{1}{\|x - y\|}. \label{eq:potential-kernel} \end{equation*}

2.1.2. Gravitational interactions

In this case, we have a set of \(N\) celestial bodies (planets, stars, etc…) each with a mass \(m_i\), and we aim to calculate the total force \(\vec{F}_i\) acting on each one of them using the following formula:

\begin{equation} \vec{F}_i = G m_i \sum_{j = 1, i \neq j}^{N}{ \frac{(x_j - x_i)}{\|x_i - x_j\|^3} m_j } \quad \forall i \in \{1,\dots,N\}, \label{eq:force} \end{equation}

where \(G\) is the gravitational constant. Thus, the interaction function is defined as:

\begin{equation*} k(x,y) = \frac{x - y}{\|x - y\|^3}. \label{eq:force-kernel} \end{equation*}

2.2. The Fast Mutipole Method

One can notice that the naive direct evaluation of \eqref{eq:n-body} has quadratic complexity (i.e., \(\BigO(N^2)\)). The Fast Multipole Method (FMM) (Greengard and Rokhlin 1987) is a hierarchical method used to accelerate the evaluation of \eqref{eq:n-body}. It specifically speeds up the computation of long-range interactions (i.e., the distant interactions) between a large number of bodies. Instead of calculating all pairwise interactions by evaluating \eqref{eq:n-body}, which would be obviously prohibitively time-consuming, the FMM groups distant bodies and approximates their interactions. This significantly reduces the computation time. For a more comprehensive, in-depth explanation of the FMM, we direct the reader to the first chapter of (Blanchard 2017). However, a thorough understanding of the FMM is not required to follow this document; this reference is provided simply to give the overall context.

In the FMM algorithm, the space in which the \(N\) bodies are located is hierarchically divided into clusters (or boxes) that are nested within one another. This hierarchical division is usually represented by a tree, where each node represents a box (this is a \(d\) -tree; in 3D it is called an octree, and in 2D, a quadtree,…). We usually consider that interactions between bodies located in the same box or in neighboring boxes are part of the near field, while other interactions are part of the far field.

In practice, the FMM separates the computation of the near field from the far field. As briefly mentioned above, the FMM groups distant bodies and uses approximations to compute the far field. However, for the near field calculation, approximations are not accurate enough. As a consequence, the interactions between each pair of bodies must be computed explicitly and precisely (like in expression \eqref{eq:n-body}). This is the purpose of the P2P operator, which we will describe in the next paragraph. In the end, the FMM algorithm performs a fast evaluation of \eqref{eq:n-body} (by "fast evaluation" we mean \(\BigO(N \log{N})\) and even \(\BigO(N)\) with a multi-level strategy).

2.3. The P2P kernel

As we mentioned earlier, the P2P (Particle-to-Particle) kernel is the part of the FMM algorithm responsible for the direct calculation of interactions between nearby bodies.

There are actually two variants of the P2P kernel, but the underlying concept is the same for both: iterating through bodies to update the output of others. In this context, we refer to the bodies whose outputs are being updated as target bodies, and those contributing to the interactions as source bodies. Note that, in most cases, the target and source bodies are actually the same.

  • First version: The P2P inner kernel (see Algorithm 1) calculates the interactions between bodies located within the same box \(B\). We can notice that we skip the case where \(i = j\) (i.e., self-interaction). This situation corresponds to the one illustrated in Figure 2.

p2p-inner-algo.svg

Figure 1: P2P inner algorithm

p2p-inner-box.svg

Figure 2: Bodies in a single box (typical situation for the P2P inner kernel)

  • Second version: The P2P outer kernel (see Algorithm 3) calculates the interactions between bodies located in two neighboring boxes. We consider a first box \(B_1\) with \(M\) bodies and a second box \(B_2\)​ with \(N\) other bodies. Each of the bodies in \(B_1\)​ contributes to update the output of the bodies in \(B_2\). This situation corresponds to the one illustrated in Figure 4.

p2p-outer-algo.svg

Figure 3: P2P outer algorithm

p2p-outer-box.svg

Figure 4: Bodies in neighboring boxes (typical situation for the P2P outer kernel)

From now on, without loss of generality, we will focus only on the P2P inner kernel in the case of the electrostatic potential (see \eqref{eq:potential}) in a simple two-dimensional application (i.e, \(n = 2\)). In other words, we aim to compute:

\begin{equation} p_i = \sum_{j = 1}^{N}{ \frac{q_j}{\|x_i - x_j\|} } \quad \forall i \in \{1,\dots,N\} \qquad \text{ with } x_i \in \mathbb{R}^2, \label{eq:potential-electro} \end{equation}

with \(p_i\) the potential of particle \(i\) and \(q_j\) the charge of particle \(j\). This was the target kernel that we attempted to implement using Kokkos.

2.4. Kokkos in a nutshell

Before moving forward, let's provide a brief introduction to Kokkos (Trott et al. 2022; Edwards, Trott, and Sunderland 2014). It is most of the time presented as a C++ library designed for performance portability in HPC applications. It allows developers to write parallel code that runs efficiently across various hardware architectures, including CPUs and GPUs, without the need to rewrite code for each platform. By providing abstractions for parallel execution and memory management, Kokkos allows users to leverage different backends (e.g., OpenMP, CUDA) through a single, unified API. However, it is important to note that these backends must be selected at compile-time. During the internship, we tested the following backends: OpenMP (CPU), C++ threads (CPU), and CUDA (GPU). However, other backends are currently supported (such as HIP) and more are planned for the future. Last but not least, Kokkos relies on key abstractions, such as the memory space, which determines where the data is located, and the execution space, which specifies where the code is executed.

The most convenient way to install Kokkos is via CMake (but you can find other alternatives in the documentation). However, Guix also offers an easy way to switch between different backends, making it simple to configure Kokkos for various architectures. For example, the following lines can be used to create development environments via guix shell using the --pure option:

guix shell --pure kokkos-openmp gcc-toolchain@11 <other packages> ...
guix shell --pure kokkos-threads gcc-toolchain@11 <other packages> ...
  • Using CUDA as a backend:
guix shell --pure kokkos-cuda-a100 gcc-toolchain@11 cuda-toolkit@12.4 <other packages> ...
export LD_PRELOAD=/usr/lib64/libcuda.so:/usr/lib64/libnvidia-ptxjitcompiler.so # required to use CUDA with guix

We provide suitable Guix environments to run the implementations with these backends in the results section.

As mentioned earlier, Kokkos is written in modern C++ and makes extensive use of advanced C++ programming techniques. However, it offers a straightforward API, so users do not need to engage with complex C++ code to work with Kokkos. That said, it is still beneficial to be familiar with C++ templates. Roughly speaking, Templates allow for writing generic and reusable code that can handle different data types. They allow users to write functions and classes to be defined independently of specific types, so the same code can operate on various data types without duplication.

3. Source code

3.1. Baseline implementation

We begin by introducing the naive sequential implementation of the P2P inner kernel. This version serves as a baseline for the experiments and is only used to verify the numerical results. The implementation is quite basic, leaving significant room for optimization. Indeed, we can notice for example that it does not take advantage of the computation's inherent symmetry. In this approach, we simply iterate over the target particles and update their potential by calculating interactions with all source particles, while skipping self-interactions (we remind the reader that, in that case, the target and source particles are actually the same).

#pragma once

#include "utils.hpp"

#include <concepts>
#include <iostream>
#include <vector>

/**
 * @brief Naive version of the P2P inner kernel.
 *
 * This function calculates pairwise interactions (potential in this case) between particles based on
 * their positions and charges. It iterates over all target particles and computes their interaction
 * with all source particles, except themselves, and stores the result in the `potential` array.
 *
 * @tparam ContainerType The type of container used for the positions, charges, and potentials. This container must
 *         support random access (e.g., `std::vector`, `std::array`) and store floating-point values.
 *
 * @param[in] position_x Container of x-coordinates of the particles.
 * @param[in] position_y Container of y-coordinates of the particles.
 * @param[in] charge Container of charge values associated with each particle.
 * @param[out] potential Output container where the computed potential for each particle is stored.
 *
 */
template<Container ContainerType>
auto simplified_p2p_inner(ContainerType const& position_x, ContainerType const& position_y, ContainerType const& charge,
                          ContainerType& potential) -> void
{
    // set-up (aliases)
    using container_type = ContainerType;
    using value_type = typename container_type::value_type;

    const std::size_t size = position_x.size();
    value_type result;

    // loop through the target data
    for(std::size_t i = 0; i < size; ++i)
    {
        const value_type local_x_target = position_x[i], local_y_target = position_y[i];
        result = 0.;
        // loop through the source data (we skip the case where 'i == j')
        for(std::size_t j = 0; j < i; ++j)
        {
            const value_type diff_y = position_y[j] - local_y_target, diff_x = position_x[j] - local_x_target;
            result += charge[j] / std::sqrt(diff_y * diff_y + diff_x * diff_x);
        }
        for(std::size_t j = i + 1; j < size; ++j)
        {
            const value_type diff_y = position_y[j] - local_y_target, diff_x = position_x[j] - local_x_target;
            result += charge[j] / std::sqrt(diff_y * diff_y + diff_x * diff_x);
        }

        // update potential
        potential[i] = result;
    }
}

3.2. Existing implementations

Prior to the internship, there were existing implementations of the P2P inner kernel, which were originally more generic but had been adapted to the simplified two-dimensional case. These included a CPU vectorized version using xsimd and a CUDA version. We begin by presenting these two versions, as they will serve as a baseline for comparison with the Kokkos-based implementations.

3.2.1. Implementation on CPU using xsimd

This implementation is a simplified version of a kernel extracted from ScalFMM 3.0 and was adapted for this simplified context (the two-dimensional case). It leverages the xsimd C++ library which provides wrappers for SIMD intrinsics. Roughly speaking, the xsimd library automatically detects and uses the available SIMD instruction sets on the target machine, for instance SSE, AVX, AVX2, and AVX512 on x86 architectures, and NEON on ARM architectures (an exhaustive list of the supported instruction sets is given in their documentation). One of the key benefits of using xsimd is its portability; it offers a high-level API that can be used across different CPU architectures, simplifying the process of writing SIMD-optimized code without having to handle specific low-level instruction sets (have a look at the Intel instrinsics guide to see how cumbersome it can be to use it manually, especially for beginners). This library relies on SIMD instruction detection at compile-time, allowing the same code to adapt to different target architectures. Another consequence is that these wrappers don't introduce any runtime overhead compared to the use of raw instrinsics. Additionally, xsimd is tightly integrated with xtensor, a Numpy-like multidimensional array library for C++, which provides excellent tools for scientific computing. In the implementation below, we vectorize the first part of the outermost loop (over the target data) by using SIMD batches to process multiple particles simultaneously, while the remaining part of the loop is treated in the "standard scalar manner". To remove self-interactions, we use SIMD batch logical masks, which act as filters to exclude them efficiently.

#pragma once

#include "utils.hpp"

#include <concepts>
#include <iostream>
#include <type_traits>

#include <xsimd/xsimd.hpp>

// Helper function to print the SIMD instruction set being used
void print_simd_instruction_set()
{
    std::cout << "- SIMD instruction set: ";
#if XSIMD_WITH_AVX512F
    std::cout << "AVX-512" << '\n';
#elif XSIMD_WITH_AVX2
    std::cout << "AVX2" << '\n';
#elif XSIMD_WITH_AVX
    std::cout << "AVX" << '\n';
#elif XSIMD_WITH_SSE4_2
    std::cout << "SSE4.2" << '\n';
#elif XSIMD_WITH_SSE4_1
    std::cout << "SSE4.1" << '\n';
#elif XSIMD_WITH_SSE2
    std::cout << "SSE2" << '\n';
#elif XSIMD_WITH_NEON
    std::cout << "NEON" << '\n';
#else
    std::cout << "Unknown SIMD instruction set \n";
#endif
}

/**
* @brief P2P inner kernel using SIMD vectorization via xsimd.
*
* This function calculates pairwise interactions (potential in this case) between particles based
* on their positions and charges. It leverages SIMD vectorization via the **xsimd** library to
* process multiple particles simultaneously. The function uses SIMD to handle large portions of the
* data in parallel, while a scalar fallback handles any remaining particles that don't fit into a full
* SIMD batch. We use SIMD batch logical masks to remove self-interactions (acting as a filter).
*
* @tparam ContainerType The type of container used for the positions, charges, and potentials. This container must
*         support random access (e.g., `std::vector`, `std::array`) and store floating-point values.
*
* @param[in] position_x Container of x-coordinates of the particles.
* @param[in] position_y Container of y-coordinates of the particles.
* @param[in] charge Container of charge values associated with each particle.
* @param[out] potential Output container where the computed potential for each particle is stored.
*
*/
template<Container ContainerType>
auto p2p_inner_xsimd(ContainerType const& position_x, ContainerType const& position_y, ContainerType const& charge,
                     ContainerType& potential) -> void
{
    // Set-up (aliases)
    using container_type = ContainerType;
    using value_type = typename container_type::value_type;
    using batch_type = xsimd::batch<value_type>;
    static constexpr std::size_t batch_size = batch_type::size;

    const std::size_t size = position_x.size();
    const std::size_t vec_size = size - size % batch_size;

    const value_type local_tolerance{std::numeric_limits<value_type>::epsilon()};

    batch_type vec_contribution;

    // Initialize logical mask
    std::array<value_type, batch_size> next_target_indices;
    std::iota(next_target_indices.begin(), next_target_indices.end(), value_type(0.));
    auto vec_next_target_indices = xsimd::load_unaligned(next_target_indices.data());

    // Loop through target data (vectorized part)
    for(std::size_t i = 0; i < vec_size; i += batch_size, vec_next_target_indices += batch_size)
    {
        const batch_type vec_local_target_x = xsimd::load_unaligned(&position_x[i]);
        const batch_type vec_local_target_y = xsimd::load_unaligned(&position_y[i]);

        vec_contribution = 0.;

        // Loop through source data
        for(std::size_t j = 0; j < size; ++j)
        {
            const batch_type vec_diff_x = vec_local_target_x - position_x[j],
                             vec_diff_y = vec_local_target_y - position_y[j];

            // Apply logical mask to skip the case 'i = j'
            const auto batch_mask = xsimd::abs(vec_next_target_indices - j) > local_tolerance;
            vec_contribution += xsimd::select(
              batch_mask, charge[j] / xsimd::sqrt(vec_diff_x * vec_diff_x + vec_diff_y * vec_diff_y), batch_type(0.));
        }

        vec_contribution.store_unaligned(&potential[i]);
    }

    value_type result{0.};

    // Loop through target data (remaining scalar part)
    for(std::size_t i = vec_size; i < size; ++i)
    {
        const value_type local_target_x = position_x[i], local_target_y = position_y[i];
        result = 0.;

        // Loop through source data
        for(std::size_t j = 0; j < size; ++j)
        {
            if(i != j)
            {
                const value_type diff_x = local_target_x - position_x[j], diff_y = local_target_y - position_y[j];
                result += charge[j] / std::sqrt(diff_x * diff_x + diff_y * diff_y);
            }
        }

        potential[i] = result;
    }
}

3.2.2. Implementation on GPU using CUDA

The CUDA kernel presented here was originally extracted from ScalFMM 2.0 and was rewritten to fit the current simplified context. Before moving on, we provide the reader for brief yet comprehensive introductions to (old-fashioned) CUDA for beginners:

  1. Useful macro for error handling

    For convenience, we define the following macro named CHECK_CUDA to handle errors in CUDA API calls.

    #pragma once
    
    #include "utils.hpp"
    
    #include <concepts>
    #include <functional>
    #include <iostream>
    #include <type_traits>
    
    #include <cuda_runtime.h>
    
    // Error handling macro
    #define CHECK_CUDA(call)                                                                                               \
        {                                                                                                                  \
            cudaError_t err = call;                                                                                        \
            if(err != cudaSuccess)                                                                                         \
            {                                                                                                              \
                throw std::runtime_error(std::string("CUDA error at ") + __FILE__ + ":" + std::to_string(__LINE__) +       \
                                         " - " + cudaGetErrorString(err));                                                 \
            }                                                                                                              \
        }
    
  2. CUDA kernel using shared memory

    Below is the code for the CUDA version of the simplified P2P inner kernel (Nylons, Harris, and Prins 2007). This implementation leverages shared memory to reduce global memory access and improve performance. Blocks of source data (positions and charges) are cached into shared memory to minimize repeated access to global memory. This is achieved using CUDA synchronization primitives like __syncthreads to make sure that all the required data is loaded into shared memory before proceeding with the computation. Additionally, the kernel takes advantage of CUDA's thread hierarchy: each thread is responsible for computing the potential for a specific target particle by interacting with all source particles.

    #pragma once
    
    #include "cuda-common.hpp"
    #include "utils.hpp"
    
    #include <concepts>
    #include <functional>
    #include <iostream>
    #include <type_traits>
    
    #include <cuda_runtime.h>
    
    #define LOCAL_SHARED_MEMORY_SIZE 128
    
    /**
     * @brief CUDA version of the P2P inner kernel.
     *
     * This function calculates pairwise interactions (potential in this case) between particles based on
     * their positions and charges. Each thread processes a "target" particle, computes its interaction with all
     * "source" particles, and accumulates the result into the potential array. Shared memory is used to cache
     * blocks of source data (positions and charges) to reduce global memory access, enhancing performance.
     *
     * @tparam ValueType Floating-point type for the position and charge arrays (e.g., float or double).
     *
     * @param[in] position_x Global memory array holding the x-coordinates of the particles.
     * @param[in] position_y Global memory array holding the y-coordinates of the particles.
     * @param[in] charge Global memory array holding the charge of each particles.
     * @param[out] potential Global memory array where the computed potential for each particle is stored.
     * @param[in] size The total number of particles.
     */
    template<std::floating_point ValueType>
    __global__ auto p2p_inner_gpu_shared_kernel(const ValueType* const position_x, const ValueType* const position_y,
                                                const ValueType* const charge, ValueType* const potential,
                                                const std::size_t size) -> void
    {
        using value_type = ValueType;
    
        static constexpr std::size_t shared_mem_size = LOCAL_SHARED_MEMORY_SIZE;
        const std::size_t total_nb_threads = blockDim.x * gridDim.x;
        const std::size_t unique_thread_id = threadIdx.x + blockIdx.x * blockDim.x;
        const std::size_t total_nb_iterations = ((size + blockDim.x - 1) / blockDim.x) * blockDim.x;
    
        // loop through all the target data
        for(std::size_t target_idx = unique_thread_id; target_idx < total_nb_iterations; target_idx += total_nb_threads)
        {
            bool thread_compute = (target_idx < size);
    
            value_type target_position_x{}, target_position_y{}, target_potential{};
    
            if(thread_compute)
            {
                // load current target data from global memory to register
                target_position_x = position_x[target_idx];
                target_position_y = position_y[target_idx];
            }
            // loop through all the source data, processing in blocks cached to shared memory
            for(std::size_t shared_start_idx = 0; shared_start_idx < size; shared_start_idx += shared_mem_size)
            {
                __shared__ value_type shared_source_position_x[shared_mem_size];
                __shared__ value_type shared_source_position_y[shared_mem_size];
                __shared__ value_type shared_source_charge[shared_mem_size];
                // compute the number of entries to cache in shared memory
                const std::size_t nb_copies =
                  (shared_mem_size < (size - shared_start_idx)) ? shared_mem_size : (size - shared_start_idx);
    
                // load source data from global memory to shared memory
                for(std::size_t shared_local_idx = threadIdx.x; shared_local_idx < nb_copies;
                    shared_local_idx += blockDim.x)
                {
                    const std::size_t global_source_idx = shared_local_idx + shared_start_idx;
                    shared_source_position_x[shared_local_idx] = position_x[global_source_idx];
                    shared_source_position_y[shared_local_idx] = position_y[global_source_idx];
                    shared_source_charge[shared_local_idx] = charge[global_source_idx];
                }
    
                // synchronize all threads within the block
                __syncthreads();
    
                if(thread_compute)
                {
                    // skip the case 'target idx == source idx' in the summation
                    // note that 'source_idx = shared_start_idx + local_source_idx'
                    std::size_t left_copies = nb_copies;
                    if(shared_start_idx <= target_idx && target_idx < shared_start_idx + nb_copies)
                    {
                        left_copies = target_idx - shared_start_idx;
                    }
    
                    // left part: loop for 'source idx < target_idx'
                    for(std::size_t local_source_idx = 0; local_source_idx < left_copies; ++local_source_idx)
                    {
                        const value_type diff_x = shared_source_position_x[local_source_idx] - target_position_x;
                        const value_type diff_y = shared_source_position_y[local_source_idx] - target_position_y;
                        target_potential +=
                          shared_source_charge[local_source_idx] / std::sqrt(diff_x * diff_x + diff_y * diff_y);
                    }
                    // right part: loop for 'source idx > target_idx'
                    for(std::size_t local_source_idx = left_copies + 1; local_source_idx < nb_copies; ++local_source_idx)
                    {
                        const value_type diff_x = shared_source_position_x[local_source_idx] - target_position_x;
                        const value_type diff_y = shared_source_position_y[local_source_idx] - target_position_y;
                        target_potential +=
                          shared_source_charge[local_source_idx] / std::sqrt(diff_x * diff_x + diff_y * diff_y);
                    }
                }
    
                // synchronize all threads within the block
                __syncthreads();
            }
    
            // update outputs of the current target
            if(thread_compute)
            {
                potential[target_idx] = target_potential;
            }
    
            // synchronize all threads within the block
            __syncthreads();
        }
    }
    

    To run the previous kernel, all necessary data must be properly transferred to the device (GPU), and once the computation is complete, the required results are copied back to the host (CPU). This process is handled by the function shown below. We opted for "C-style" explicit memory copies, using low-level instructions like cudaMalloc and cudaMemcpy (see here). Another interesting alternative would have been to use the more modern high-level library Thrust. We could also have considered using CUDA Unified memory, which simplifies memory management. By the way, it is worth noting that Kokkos supports both approaches: explicit transfers and unified memory. Finally, as it commonly done in CUDA, we run this kernel by setting up a grid of blocks of threads, with a tunable block size.

    /**
     * @brief Executes the CUDA version of the P2P inner kernel.
     *
     * This function sets up and manages the execution of the CUDA kernel `p2p_inner_gpu_shared_kernel` that computes
     * pairwise interactions (here the potential) between particles. The function transfers data from
     * the host (CPU) to the device (GPU), allocates device memory, runs the kernel, synchronizes the GPU, retrieves
     * the results, and measures the total elapsed time of the kernel execution.
     *
     * @tparam ContainerType The type of container used for the positions, charges, and potentials. This container must
     *         support random access (e.g., `std::vector`, `std::array`) and store floating-point values.
     *
     * @param[in] position_x Container of x-coordinates of the particles on the host.
     * @param[in] position_y Container of y-coordinates of the particles on the host.
     * @param[in] charge Container of charge values associated with each particle on the host.
     * @param[out] potential Output container where the computed potential for each particle is stored on the host.
     * @param[out] elapsed_time The elapsed time (in seconds) for the kernel execution.
     * @param[in] threads_per_block The number of threads per block to use when launching the CUDA kernel.
     */
    template<Container ContainerType>
    auto p2p_inner_gpu_shared(ContainerType const& position_x, ContainerType const& position_y, ContainerType const& charge,
                              ContainerType& potential, float& elapsed_time, int threads_per_block) -> void
    {
        using container_type = ContainerType;
        using value_type = typename container_type::value_type;
    
        const std::size_t size = position_x.size();
        value_type *d_position_x, *d_position_y, *d_charge, *d_potential;
        size_t mem_size = size * sizeof(value_type);
    
        // Create CUDA events for timestamps
        cudaEvent_t start, stop;
    
        CHECK_CUDA(cudaEventCreate(&start));
        CHECK_CUDA(cudaEventCreate(&stop));
    
        // Allocate memory on the device
        CHECK_CUDA(cudaMalloc(&d_position_x, mem_size));
        CHECK_CUDA(cudaMalloc(&d_position_y, mem_size));
        CHECK_CUDA(cudaMalloc(&d_charge, mem_size));
        CHECK_CUDA(cudaMalloc(&d_potential, mem_size));
    
        // Copy data from host to device
        CHECK_CUDA(cudaMemcpy(d_position_x, position_x.data(), mem_size, cudaMemcpyHostToDevice));
        CHECK_CUDA(cudaMemcpy(d_position_y, position_y.data(), mem_size, cudaMemcpyHostToDevice));
        CHECK_CUDA(cudaMemcpy(d_charge, charge.data(), mem_size, cudaMemcpyHostToDevice));
    
        // Define grid dimension
        const std::size_t blocks_per_grid = (size + threads_per_block - 1) / threads_per_block;
    
        // Run kernel function
        CHECK_CUDA(cudaEventRecord(start));
    
        p2p_inner_gpu_shared_kernel<<<blocks_per_grid, threads_per_block>>>(d_position_x, d_position_y, d_charge,
                                                                            d_potential, size);
        CHECK_CUDA(cudaGetLastError());
        CHECK_CUDA(cudaDeviceSynchronize());
    
        CHECK_CUDA(cudaEventRecord(stop));
        CHECK_CUDA(cudaEventSynchronize(stop));
        CHECK_CUDA(cudaEventElapsedTime(&elapsed_time, start, stop));
        elapsed_time /= 1.e3;   // conversion ms -> s
    
        // Copy the result matrix from device to host
        CHECK_CUDA(cudaMemcpy(potential.data(), d_potential, mem_size, cudaMemcpyDeviceToHost));
    
        // Free device memory
        CHECK_CUDA(cudaFree(d_position_x));
        CHECK_CUDA(cudaFree(d_position_y));
        CHECK_CUDA(cudaFree(d_charge));
        CHECK_CUDA(cudaFree(d_potential));
    
        // Destroy CUDA events
        CHECK_CUDA(cudaEventDestroy(start));
        CHECK_CUDA(cudaEventDestroy(stop));
    }
    

3.3. Implementations using Kokkos

In this section, we present the implementations that were obtained using Kokkos. To give a general idea, Kokkos offers built-in instructions that can be used to parallelize common patterns. We focused particularly on the Kokkos::parallel_for and Kokkos::parallel_reduce instructions, which allow for quick parallelization of for loops. The Kokkos::parallel_for instruction is used to parallelize a for loop with independent iterations, while Kokkos::parallel_reduce is used to perform a reduction operation. We explored two approaches to parallelize the P2P inner with Kokkos. The first involves flat parallelism, where only one of the two loops is parallelized. The second approach employs nested and hierarchical parallelism, allowing both loops to be parallelized simultaneously.

3.3.1. Flat parallelism

  1. Using parallel for

    Let's dive into the code provided below. We begin by defining a set of aliases via the using keyword (this is completely optional but it helps to simplify the code by avoiding to deal directly with lengthy type names). The memory space (where the data is stored) and the execution space (where the code runs) are determined at compile time, based on the chosen backend. This is controlled by a set of macros defined before the function, although Kokkos can often deduce the appropriate memory and execution spaces automatically, depending on the selected backend. Additionally, we can specify the memory layout (for example layout right or layout left, more details here), which is useful for handling multi-dimensional arrays. For optimal performance, the most efficient layout depends on the backend in use.

    Now let's shift our focus on the "computation part" of the kernel. This implementation uses the Kokkos::parallel_for instruction to parallelize the outermost loop. Note the similarity with the traditional (and notorious) #pragma omp parallel_for OpenMP directive. This instruction is used in combination with a 1D range policy used to iterate over the source particle indices (specified as the first argument of the Kokkos::parallel_for instruction). Note that the parallelized instructions (the body) are contained within a lambda function (roughly speaking, an anonymous function), although a functor could also be used (which is a class or struct that overloads the operator() allowing instances of that class to be called as if they were functions).

    Kokkos uses custom containers called views to manage memory on both the host and the device. Unlike standard containers, only views can be used within a Kokkos::parallel_for section. Data is populated into views on the host via mirrors, and if needed, a deep copy transfers the data from the host to the device. However, if the code is set to run directly on the host, no data transfer occurs. Thus, this implementation is portable, it can run on CPUs (with OpenMP or C++ threads as backends) as well as on GPUs (using CUDA as the backend for example).

    #pragma once
    
    #include "utils.hpp"
    
    #include <Kokkos_Core.hpp>
    
    #include <concepts>
    #include <iostream>
    #include <type_traits>
    
    #ifdef KOKKOS_ENABLE_CUDA
    #define MemSpace Kokkos::CudaSpace
    #define ExecSpace Kokkos::Cuda
    #define Layout Kokkos::LayoutLeft
    #endif
    
    #ifdef KOKKOS_ENABLE_OPENMP
    #define MemSpace Kokkos::HostSpace
    #define ExecSpace Kokkos::OpenMP
    #define Layout Kokkos::LayoutRight
    #endif
    
    #ifdef KOKKOS_ENABLE_THREADS
    #define MemSpace Kokkos::HostSpace
    #define ExecSpace Kokkos::Threads
    #define Layout Kokkos::LayoutRight
    #endif
    
    #ifndef MemSpace
    #define MemSpace Kokkos::HostSpace
    #define ExecSpace Kokkos::DefaultHostExecutionSpace
    #define Layout Kokkos::LayoutRight
    #endif
    
    /**
     * @brief Portable P2P inner kernel using Kokkos parallelism (only the 'Kokkos::parallel_for' instruction).
     *
     * This function calculates pairwise interactions (potential in this case) between particles based on
     * their positions and charges using the 'Kokkos::parallel_for' instruction to parallelize the outermost
     * loop of the kernel. It enables parallel execution across different hardware architectures (e.g., CPU,
     * GPU), making it portable. It sets up device views, copies data between host and device,
     * runs the parallel computation, and retrieves the results back to the host (note that no data transfer
     * occurs when the code is executed directly on the host).
     *
     * @tparam ContainerType The type of container used for the positions, charges, and potentials. This container must
     *         support random access (e.g., `std::vector`, `std::array`) and store floating-point values.
     *
     * @param[in] position_x Container of x-coordinates of the particles on the host.
     * @param[in] position_y Container of y-coordinates of the particles on the host.
     * @param[in] charge Container of charge values associated with each particle on the host.
     * @param[out] potential Output container where the computed potential for each particle is stored on the host.
     * @param[out] elapsed_time The elapsed time (in seconds) for the parallel computation.
     *
    */
    template<Container ContainerType>
    auto p2p_inner_flat_parallel_for(ContainerType const& position_x, ContainerType const& position_y,
                                     ContainerType const& charge, ContainerType& potential, float& elapsed_time) -> void
    {
        // Set-up (aliases)
        using container_type = ContainerType;
        using value_type = typename container_type::value_type;
    
        using execution_space_type = ExecSpace;
        using memory_space_type = MemSpace;
        using layout_type = Layout;
    
        using policy_type = Kokkos::RangePolicy<execution_space_type>;
    
        using view_type = Kokkos::View<value_type*, layout_type, memory_space_type>;
        using host_mirror_type = typename view_type::HostMirror;
    
        using timer_type = Kokkos::Timer;
    
    #ifdef LOCAL_VERBOSITY
        // Print execution space details
        std::cout << std::string(40, '-') << '\n';
        execution_space_type ex{};
        std::cout << " - Execution Space = " << ex.name() << '\n';
    
        // Print memory space details
        memory_space_type mem{};
        std::cout << " - Memory Space = " << mem.name() << '\n';
        std::cout << std::string(40, '-') << '\n';
    #endif
    
        const std::size_t size = position_x.size();
    
        // Declaration of the views
        view_type view_position_x("view_position_x", size);
        view_type view_position_y("view_position_y", size);
        view_type view_charge("view_charge", size);
        view_type view_potential("view_potential", size);
    
        // Create host mirrors of device views
        host_mirror_type host_position_x = Kokkos::create_mirror_view(view_position_x);
        host_mirror_type host_position_y = Kokkos::create_mirror_view(view_position_y);
        host_mirror_type host_charge = Kokkos::create_mirror_view(view_charge);
        host_mirror_type host_potential = Kokkos::create_mirror_view(view_potential);
    
        // Initialize vectors on host (via mirrors)
        for(std::size_t i = 0; i < size; ++i)
        {
            host_position_x[i] = position_x[i];
            host_position_y[i] = position_y[i];
            host_charge[i] = charge[i];
        }
    
        // Deep copy from host mirrors to device views
        Kokkos::deep_copy(view_position_x, host_position_x);
        Kokkos::deep_copy(view_position_y, host_position_y);
        Kokkos::deep_copy(view_charge, host_charge);
    
        timer_type timer{};
    
        // Actual computation
        Kokkos::parallel_for(
          "Extern for", policy_type(0, size), KOKKOS_LAMBDA(std::size_t i) {
              const value_type local_x_target = view_position_x(i), local_y_target = view_position_y(i);
              value_type result{0.};
              for(std::size_t j = 0; j < i; ++j)
              {
                  const value_type diff_x = view_position_x(j) - local_x_target,
                                   diff_y = view_position_y(j) - local_y_target;
                  result += view_charge[j] / std::sqrt(diff_y * diff_y + diff_x * diff_x);
              }
              for(std::size_t j = i + 1; j < size; ++j)
              {
                  const value_type diff_x = view_position_x(j) - local_x_target,
                                   diff_y = view_position_y(j) - local_y_target;
                  result += view_charge[j] / std::sqrt(diff_y * diff_y + diff_x * diff_x);
              }
              view_potential(i) = result;
          });
    
        // We make sure that everything is completed before moving on...
        Kokkos::fence();
    
        elapsed_time = timer.seconds();
    
        // Deep copy device view to host view
        Kokkos::deep_copy(host_potential, view_potential);
    
        // Retrieve values
        for(std::size_t i = 0; i < size; ++i)
        {
            potential[i] = host_potential[i];
        }
    }
    
  2. Using parallel reduce

    We now present a second approach. Unlike the previous implementation, this version uses the Kokkos::parallel_reduce instruction to parallelize the innermost loop, performing a reduction on the target potential being updated. While this alternative produces correct numerical results, it is expected to have lower performance due to the reduced level of parallelism compared to the previous version (this is confirmed with the results shown in the result section). Similarly to the first approach, we use a 1D range policy to iterate over the source particle indices. Additionally, we still use views and mirrors to manage potential data transfers between the host and device as needed.

    #pragma once
    
    #include "utils.hpp"
    
    #include <Kokkos_Core.hpp>
    
    #include <concepts>
    #include <iostream>
    #include <type_traits>
    
    #ifdef KOKKOS_ENABLE_CUDA
    #define MemSpace Kokkos::CudaSpace
    #define ExecSpace Kokkos::Cuda
    #define Layout Kokkos::LayoutLeft
    #endif
    
    #ifdef KOKKOS_ENABLE_OPENMP
    #define MemSpace Kokkos::HostSpace
    #define ExecSpace Kokkos::OpenMP
    #define Layout Kokkos::LayoutRight
    #endif
    
    #ifdef KOKKOS_ENABLE_THREADS
    #define MemSpace Kokkos::HostSpace
    #define ExecSpace Kokkos::Threads
    #define Layout Kokkos::LayoutRight
    #endif
    
    #ifndef MemSpace
    #define MemSpace Kokkos::HostSpace
    #define ExecSpace Kokkos::DefaultHostExecutionSpace
    #define Layout Kokkos::LayoutRight
    #endif
    
    /**
    ,* @brief Portable P2P inner kernel using Kokkos parallelism (only the 'Kokkos::parallel_reduce' instruction).
    ,*
    ,* This function calculates pairwise interactions (potential in this case) between particles based on
    ,* their positions and charges using the 'Kokkos::parallel_reduce' instruction to parallelize the innermost
    ,* loop of the kernel. It enables parallel execution across different hardware architectures (e.g., CPU,
    ,* GPU), making it portable. It sets up device views, copies data between host and device,
    ,* runs the parallel computation, and retrieves the results back to the host (note that no data transfer
    ,* occurs when the code is executed directly on the host).
    ,*
    ,* @tparam ContainerType The type of container used for the positions, charges, and potentials. This container must
    ,*         support random access (e.g., `std::vector`, `std::array`) and store floating-point values.
    ,*
    ,* @param[in] position_x Container of x-coordinates of the particles on the host.
    ,* @param[in] position_y Container of y-coordinates of the particles on the host.
    ,* @param[in] charge Container of charge values associated with each particle on the host.
    ,* @param[out] potential Output container where the computed potential for each particle is stored on the host.
    ,* @param[out] elapsed_time The elapsed time (in seconds) for the parallel computation.
    ,*
    ,*/
    template<Container ContainerType>
    auto p2p_inner_flat_parallel_reduce(ContainerType const& position_x, ContainerType const& position_y, ContainerType const& charge,
                                        ContainerType& potential, float& elapsed_time) -> void
    {
        // Set-up (aliases)
        using container_type = ContainerType;
        using value_type = typename container_type::value_type;
    
        using execution_space_type = ExecSpace;
        using memory_space_type = MemSpace;
        using layout_type = Layout;
    
        using policy_type = Kokkos::RangePolicy<execution_space_type>;
    
        using view_type = Kokkos::View<value_type*, layout_type, memory_space_type>;
        using host_mirror_type = typename view_type::HostMirror;
    
        using timer_type = Kokkos::Timer;
    
    #ifdef LOCAL_VERBOSITY
        // Print execution space details
        std::cout << std::string(40, '-') << '\n';
        execution_space_type ex{};
        std::cout << " - Execution Space = " << ex.name() << '\n';
    
        // Print memory space details
        memory_space_type mem{};
        std::cout << " - Memory Space = " << mem.name() << '\n';
        std::cout << std::string(40, '-') << '\n';
    #endif
    
        const std::size_t size = position_x.size();
    
        // Declaration of the views
        view_type view_position_x("view_position_x", size);
        view_type view_position_y("view_position_y", size);
        view_type view_charge("view_charge", size);
        view_type view_potential("view_potential", size);
    
        // Create host mirrors of device views
        host_mirror_type host_position_x = Kokkos::create_mirror_view(view_position_x);
        host_mirror_type host_position_y = Kokkos::create_mirror_view(view_position_y);
        host_mirror_type host_charge = Kokkos::create_mirror_view(view_charge);
        host_mirror_type host_potential = Kokkos::create_mirror_view(view_potential);
    
        // Initialize vectors on host (via mirrors)
        for(std::size_t i = 0; i < size; ++i)
        {
            host_position_x[i] = position_x[i];
            host_position_y[i] = position_y[i];
            host_charge[i] = charge[i];
        }
    
        // Deep copy host mirrors to device views
        Kokkos::deep_copy(view_position_x, host_position_x);
        Kokkos::deep_copy(view_position_y, host_position_y);
        Kokkos::deep_copy(view_charge, host_charge);
    
        timer_type timer{};
    
        // Actual computation
        for(std::size_t i = 0; i < size; ++i)
        {
            const value_type local_x_target = host_position_x(i), local_y_target = host_position_y(i);
    
            Kokkos::parallel_reduce(
              "Intern reduce", policy_type(0, size),
              KOKKOS_LAMBDA(std::size_t j, value_type& result) {
                  if(i != j)
                  {
                      const value_type diff_y = view_position_y(j) - local_y_target, diff_x = view_position_x(j) - local_x_target;
                      result += view_charge[j] / std::sqrt(diff_y * diff_y + diff_x * diff_x);
                  }
              },
              host_potential(i));
        }
    
        // We make sure that everything is completed before moving on...
        Kokkos::fence();
    
        elapsed_time = timer.seconds();
    
        // No deep copy needed here as host_potential was directly updated
        // Kokkos::deep_copy( host_potential, view_potential );
    
        // Retrieve values
        for(std::size_t i = 0; i < size; ++i)
        {
            potential[i] = host_potential[i];
        }
    }
    

3.3.2. Nested parallelism

The two previous approaches can be categorized as flat parallelism. However, this new version of the P2P inner kernel below uses nested parallelism, which involves multiple layers of parallel execution. In practice, we combine several Kokkos instructions, such as Kokkos::parallel_for and Kokkos::parallel_reduce, to achieve this (most of the time, we expect this to lead to better results than with simple flat parallelism). While the handling of views, mirrors, and deep copies remains unchanged from the previous implementations, the "computation part" differs as both loops are now parallelized. Instead of using a simple 1D range policy, we employ a team policy that distributes the workload across the two loops. This approach distributes the threads into teams that form a league (more details here), drawing an analogy to CUDA, where all the threads are organized in a grid made up of blocks.

#pragma once

#include "utils.hpp"

#include <Kokkos_Core.hpp>

#include <concepts>
#include <iostream>
#include <type_traits>

#ifdef KOKKOS_ENABLE_CUDA
#define MemSpace Kokkos::CudaSpace
#define ExecSpace Kokkos::Cuda
#define Layout Kokkos::LayoutLeft
#endif

#ifdef KOKKOS_ENABLE_OPENMP
#define MemSpace Kokkos::HostSpace
#define ExecSpace Kokkos::OpenMP
#define Layout Kokkos::LayoutRight
#endif

#ifdef KOKKOS_ENABLE_THREADS
#define MemSpace Kokkos::HostSpace
#define ExecSpace Kokkos::Threads
#define Layout Kokkos::LayoutRight
#endif

#ifndef MemSpace
#define MemSpace Kokkos::HostSpace
#define ExecSpace Kokkos::DefaultHostExecutionSpace
#define Layout Kokkos::LayoutRight
#endif

/**
* @brief Portable P2P inner kernel using Kokkos parallelism (both 'Kokkos::parallel_for' and 'Kokkos::parallel_reduce' instructions).
*
* This function calculates pairwise interactions (potential in this case) between particles based on
* their positions and charges using nested parallelism via Kokkos. It combine the 'Kokkos::parallel_for'
* instruction to parallelize the outermost loop and the 'Kokkos::parallel_reduce' instruction to parallelize
* the innermost loop.
*
* @tparam ContainerType The type of container used for the positions, charges, and potentials. This container must
*         support random access (e.g., `std::vector`, `std::array`) and store floating-point values.
*
* @param[in] position_x Container of x-coordinates of the particles on the host.
* @param[in] position_y Container of y-coordinates of the particles on the host.
* @param[in] charge Container of charge values associated with each particle on the host.
* @param[out] potential Output container where the computed potential for each particle is stored on the host.
* @param[out] elapsed_time The elapsed time (in seconds) for the parallel computation.
*/
template<Container ContainerType>
auto p2p_inner_nested_parallel(ContainerType const& position_x, ContainerType const& position_y,
                               ContainerType const& charge, ContainerType& potential, float& elapsed_time) -> void
{
    // Set-up (aliases)
    using container_type = ContainerType;
    using value_type = typename container_type::value_type;

    using execution_space_type = ExecSpace;
    using memory_space_type = MemSpace;
    using layout_type = Layout;

    using view_type = Kokkos::View<value_type*, layout_type, memory_space_type>;
    using host_mirror_type = typename view_type::HostMirror;

    using range_policy_type = Kokkos::TeamPolicy<execution_space_type>;
    using member_type = typename range_policy_type::member_type;

    using timer_type = Kokkos::Timer;

#ifdef LOCAL_VERBOSITY
    // Print execution space details
    std::cout << std::string(40, '-') << '\n';
    execution_space_type ex{};
    std::cout << " - Execution Space = " << ex.name() << '\n';

    // Print memory space details
    memory_space_type mem{};
    std::cout << " - Memory Space = " << mem.name() << '\n';
    std::cout << std::string(40, '-') << '\n';
#endif

    const std::size_t size = position_x.size();

    // Declaration of the views
    view_type view_position_x("view_position_x", size);
    view_type view_position_y("view_position_y", size);
    view_type view_charge("view_charge", size);
    view_type view_potential("view_potential", size);

    // Create host mirrors of device views
    host_mirror_type host_position_x = Kokkos::create_mirror_view(view_position_x);
    host_mirror_type host_position_y = Kokkos::create_mirror_view(view_position_y);
    host_mirror_type host_charge = Kokkos::create_mirror_view(view_charge);
    host_mirror_type host_potential = Kokkos::create_mirror_view(view_potential);

    // Initialize vectors on host (via mirrors)
    for(std::size_t i = 0; i < size; ++i)
    {
        host_position_x[i] = position_x[i];
        host_position_y[i] = position_y[i];
        host_charge[i] = charge[i];
    }

    // Deep copy host mirrors to device views
    Kokkos::deep_copy(view_position_x, host_position_x);
    Kokkos::deep_copy(view_position_y, host_position_y);
    Kokkos::deep_copy(view_charge, host_charge);

    timer_type timer{};

    // Actual computation
    Kokkos::parallel_for(
      "Extern for", range_policy_type(size, Kokkos::AUTO), KOKKOS_LAMBDA(const member_type& teamMember) {
          const std::size_t i = teamMember.league_rank();
          const value_type local_x_target = view_position_x(i), local_y_target = view_position_y(i);
          value_type result{0.};

          Kokkos::parallel_reduce(
            Kokkos::TeamThreadRange(teamMember, size),
            [&](std::size_t j, value_type& innerUpdate)
            {
                if(i != j)
                {
                    const value_type diff_y = view_position_y(j) - local_y_target, diff_x = view_position_x(j) - local_x_target;
                    innerUpdate += view_charge[j] / std::sqrt(diff_y * diff_y + diff_x * diff_x);
                }
            },
            result);

          Kokkos::single(Kokkos::PerTeam(teamMember), [&]() { view_potential(i) += result; });
      });

    // We make sure that everything is completed before moving on...
    Kokkos::fence();

    elapsed_time = timer.seconds();

    // Deep copy from device view to host mirror
    Kokkos::deep_copy(host_potential, view_potential);

    // Retrieve values
    for(std::size_t i = 0; i < size; ++i)
    {
        potential[i] = host_potential[i];
    }
}

3.3.3. Hierarchical parallelism

Finally, an attempt was made to implement what is known in the Kokkos ecosystem as hierarchical parallelism. As in the nested approach, this one combines several instructions, such as Kokkos::parallel_for and Kokkos::parallel_reduce. Likewise, instead of using a simple 1D range policy, we use a team policy to distribute the workload across several loops. As in the previous section, we use teams of threads arranged into a league, allowing for more structured parallel execution. The difference this time is that we use a shared memory space for the threads within a team, known as the team scratch pad memory, which is definitely analogous to CUDA’s shared memory. We use this scratch pad memory to cache data that can be reused during computations. Below is an example of how this feature can be used. We reorganized the computation to distribute the workload in a way that leverages the scratch pad memory. Although this version produced accurate numerical results, it did not yield a significant performance improvement over the previous implementations (see the result section), indicating that further investigation is needed, and this remains a work in progress at the time of writing.

#pragma once

#include "utils.hpp"

#include <Kokkos_Core.hpp>

#include <concepts>
#include <iostream>
#include <type_traits>

#ifdef KOKKOS_ENABLE_CUDA
#define MemSpace Kokkos::CudaSpace
#define ExecSpace Kokkos::Cuda
#define Layout Kokkos::LayoutLeft
#endif

#ifdef KOKKOS_ENABLE_OPENMP
#define MemSpace Kokkos::HostSpace
#define ExecSpace Kokkos::OpenMP
#define Layout Kokkos::LayoutRight
#endif

#ifdef KOKKOS_ENABLE_THREADS
#define MemSpace Kokkos::HostSpace
#define ExecSpace Kokkos::Threads
#define Layout Kokkos::LayoutRight
#endif

#ifndef MemSpace
#define MemSpace Kokkos::HostSpace
#define ExecSpace Kokkos::DefaultHostExecutionSpace
#define Layout Kokkos::LayoutRight
#endif

/**
* @brief Portable P2P inner kernel using Kokkos parallelism (hierarchical parallelism).
*
* This function calculates pairwise interactions (potential in this case) between particles based on
* their positions and charges using a hierarchical parallel execution model provided by the Kokkos library.
* The computation is divided into teams of threads that process chunks of data, making use of shared
* scratch memory to minimize global memory access and improve performance. The function handles data
* transfer between host and device, manages synchronization, and measures the total elapsed time
* for the computation.
*
* @tparam ContainerType The type of container used for the positions, charges, and potentials. This container must
*         support random access (e.g., `std::vector`, `std::array`) and store floating-point values.
*
* @param[in] position_x Container of x-coordinates of the particles on the host.
* @param[in] position_y Container of y-coordinates of the particles on the host.
* @param[in] charge Container of charge values associated with each particle on the host.
* @param[out] potential Output container where the computed potential for each particle is stored on the host.
* @param[in] chunk_size The size of each chunk to be processed by a team of threads.
* @param[in] vec_size The vector length (to use in the hierarchical parallel execution).
* @param[out] elapsed_time The elapsed time (in seconds) for the parallel computation.
*/
template<Container ContainerType>
auto p2p_inner_hierarchical_parallel(ContainerType const& position_x, ContainerType const& position_y,
                                     ContainerType const& charge, ContainerType& potential, std::size_t chunk_size,
                                     std::size_t vec_size, float& elapsed_time) -> void
{
    // Set-up (aliases)
    using container_type = ContainerType;
    using value_type = typename container_type::value_type;

    using execution_space_type = ExecSpace;
    using memory_space_type = MemSpace;
    using layout_type = Layout;

    using view_vector_type = Kokkos::View<value_type*, layout_type, memory_space_type>;
    using view_matrix_type = Kokkos::View<value_type**, layout_type, memory_space_type>;

    using host_mirror_type = typename view_vector_type::HostMirror;
    using host_mirror_matrix_type = typename view_matrix_type::HostMirror;

    using range_policy_type = Kokkos::TeamPolicy<execution_space_type>;
    using member_type = typename range_policy_type::member_type;

    using scratch_memory_space_type = typename execution_space_type::scratch_memory_space;
    using scratch_pad_view_type = Kokkos::View<value_type*, scratch_memory_space_type>;

    using timer_type = Kokkos::Timer;

#ifdef LOCAL_VERBOSITY
    // Print execution space details
    std::cout << std::string(40, '-') << '\n';
    execution_space_type ex{};
    std::cout << " - Execution Space = " << ex.name() << '\n';

    // Print memory space details
    memory_space_type mem{};
    std::cout << " - Memory Space = " << mem.name() << '\n';
    std::cout << std::string(40, '-') << '\n';
#endif

    const std::size_t size = position_x.size();
    const std::size_t nb_chunks = (size + chunk_size - 1) / chunk_size;

    // Declaration of the views
    view_matrix_type view_position_x("view_position_x", chunk_size, nb_chunks);
    view_matrix_type view_position_y("view_position_y", chunk_size, nb_chunks);
    view_matrix_type view_charge("view_charge", chunk_size, nb_chunks);
    view_matrix_type view_potential("view_potential", size, nb_chunks);

    // Create host mirrors of device views
    host_mirror_matrix_type host_position_x = Kokkos::create_mirror_view(view_position_x);
    host_mirror_matrix_type host_position_y = Kokkos::create_mirror_view(view_position_y);
    host_mirror_matrix_type host_charge = Kokkos::create_mirror_view(view_charge);
    host_mirror_matrix_type host_potential = Kokkos::create_mirror_view(view_potential);

    // Initialize vectors on host (via mirrors)
    std::size_t local_index{};
    for(std::size_t e = 0; e < nb_chunks; ++e)
    {
        for(std::size_t i = 0; i < chunk_size; ++i)
        {
            local_index = e * chunk_size + i;
            if(local_index < size)
            {
                host_position_x(i, e) = position_x[local_index];
                host_position_y(i, e) = position_y[local_index];
                host_charge(i, e) = charge[local_index];
            }
            else
            {
                // Zero-padding
                host_position_x(i, e) = 0.;
                host_position_y(i, e) = 0.;
                host_charge(i, e) = 0.;
            }
        }
    }

    // Deep copy from host mirrors to device views
    Kokkos::deep_copy(view_position_x, host_position_x);
    Kokkos::deep_copy(view_position_y, host_position_y);
    Kokkos::deep_copy(view_charge, host_charge);

    std::size_t scratch_mem_size{3 * scratch_pad_view_type::shmem_size(chunk_size)};

    timer_type timer{};

    // Actual computation
    int level{0};
    Kokkos::parallel_for(
      "Extern for",
      range_policy_type(nb_chunks, Kokkos::AUTO, vec_size).set_scratch_size(level, Kokkos::PerTeam(scratch_mem_size)),
      KOKKOS_LAMBDA(const member_type& team_member) {
          const std::size_t e = team_member.league_rank();

          // Cache source data into scratch pad memory
          scratch_pad_view_type shared_position_x(team_member.team_scratch(level), chunk_size);
          scratch_pad_view_type shared_position_y(team_member.team_scratch(level), chunk_size);
          scratch_pad_view_type shared_charge(team_member.team_scratch(level), chunk_size);
          if(team_member.team_rank() == 0)
          {
              Kokkos::parallel_for(Kokkos::ThreadVectorRange(team_member, chunk_size),
                                   [&](std::size_t i)
                                   {
                                       shared_position_x(i) = view_position_x(i, e);
                                       shared_position_y(i) = view_position_y(i, e);
                                       shared_charge(i) = view_charge(i, e);
                                   });
          }

          // Synchronization (to make sure that all the values are cached)
          team_member.team_barrier();

          Kokkos::parallel_for(Kokkos::TeamThreadRange(team_member, size),
                               [&](std::size_t j)
                               {
                                   const std::size_t local_i = j % chunk_size, local_e = j / chunk_size;
                                   const value_type local_x_target = view_position_x(local_i, local_e),
                                                    local_y_target = view_position_y(local_i, local_e);

                                   value_type local_result{0.};

                                   Kokkos::parallel_reduce(
                                     Kokkos::ThreadVectorRange(team_member, chunk_size),
                                     [&](std::size_t i, value_type& update_local_result)
                                     {
                                         if(j != e * chunk_size + i && e * chunk_size + i < size)
                                         {
                                             const value_type diff_y = shared_position_y(i) - local_y_target,
                                                              diff_x = shared_position_x(i) - local_x_target;
                                             update_local_result +=
                                               shared_charge(i) / std::sqrt(diff_y * diff_y + diff_x * diff_x);
                                         }
                                     },
                                     local_result);

                                   view_potential(j, e) = local_result;
                               });
      });

    // Final reduction (over all the chunks)
    Kokkos::parallel_for(
      "Extern for", range_policy_type(size, Kokkos::AUTO), KOKKOS_LAMBDA(const member_type& team_member) {
          const std::size_t j = team_member.league_rank();
          value_type result{0.};

          Kokkos::parallel_reduce(
            Kokkos::TeamVectorRange(team_member, nb_chunks),
            [&](std::size_t e, value_type& inner_update) { inner_update += view_potential(j, e); }, result);

          Kokkos::single(Kokkos::PerTeam(team_member), [&]() { view_potential(j, 0) = result; });
      });

    // We make sure that everything is completed before moving on...
    Kokkos::fence();

    elapsed_time = timer.seconds();

    // Deep copy device view to host mirror
    Kokkos::deep_copy(host_potential, view_potential);

    // Retrieve values
    for(std::size_t j = 0; j < size; ++j)
    {
        potential[j] = host_potential(j, 0);
    }
}

3.3.4. Vectorization with Kokkos via Kokkos SIMD

Additionally, we implemented a version that uses the Kokkos SIMD API (still experimental at the time of writing), which provides tools similar to those found in xsimd, acting as a set of wrappers for SIMD intrinsics. Unlike previous approaches, this version is designed to run only on the CPU but it is portable across different CPU architectures, though it supports fewer instruction sets compared to xsimd. This feature of Kokkos allows us to write portable vectorized code specifically for the CPU. However, during the internship, we did not manage to develop a version that could run on both the GPU and the CPU with vectorization (again this is still a work in progress). The versions presented in previous sections can run on the GPU via CUDA and on the CPU via OpenMP, for example, but not with explicit vectorization like xsimd. Nonetheless, in the next section, we demonstrate that this implementation achieved performance comparable to that of xsimd (the code is also very similar).

 #pragma once

 #include "utils.hpp"

 #include <Kokkos_Core.hpp>
 #include <Kokkos_SIMD.hpp>

 #include <concepts>
 #include <iostream>
 #include <type_traits>

/**
* @brief P2P inner kernel using SIMD vectorization via Kokkos::simd API (still experimental
* at the time of writing).
*
* This function calculates pairwise interactions (potential in this case) between particles based
* on their positions and charges. It leverages SIMD vectorization via the SIMD API offered by Kokkos
* to process multiple particles simultaneously. The function uses SIMD to handle large portions of the
* data in parallel, while a scalar fallback handles any remaining particles that don't fit into a full
* SIMD batch. We use SIMD batch logical masks to remove self-interactions (acting as a filter).
*
* @tparam ContainerType The type of container used for the positions, charges, and potentials. This container must
*         support random access (e.g., `std::vector`, `std::array`) and store floating-point values.
*
* @param[in] position_x Container of x-coordinates of the particles.
* @param[in] position_y Container of y-coordinates of the particles.
* @param[in] charge Container of charge values associated with each particle.
* @param[out] potential Output container where the computed potential for each particle is stored.
*
*/
template<Container ContainerType>
 auto p2p_inner_kokkos_simd(ContainerType const& position_x, ContainerType const& position_y,
                            ContainerType const& charge, ContainerType& potential) -> void
 {
     // Set-up (aliases)
     using container_type = ContainerType;
     using value_type = typename container_type::value_type;
     using ksimd_type = Kokkos::Experimental::native_simd<value_type>;
     static constexpr std::size_t batch_size = ksimd_type::size();

     const std::size_t size = position_x.size();
     const std::size_t vec_size = size - size % batch_size;

     ksimd_type vec_local_x_target, vec_local_y_target, vec_contribution, vec_local_result;

     // Initialize logical mask
     ksimd_type vec_next_target_indices([](std::size_t i) { return i; }), vec_source_index;

     // Loop through target data (vectorized part)
     for(std::size_t i = 0; i < vec_size; i += batch_size, vec_next_target_indices += batch_size)
     {
         vec_local_x_target.copy_from(position_x.data() + i, Kokkos::Experimental::simd_flag_default);
         vec_local_y_target.copy_from(position_y.data() + i, Kokkos::Experimental::simd_flag_default);

         vec_contribution = 0.;

         // Loop through source data
         for(std::size_t j = 0; j < size; ++j)
         {
             const ksimd_type vec_diff_x = vec_local_x_target - position_x[j];
             const ksimd_type vec_diff_y = vec_local_y_target - position_y[j];

             // Apply logical mask to skip the case 'i = j'
             vec_source_index = j;
             vec_local_result = charge[j] / Kokkos::sqrt(vec_diff_x * vec_diff_x + vec_diff_y * vec_diff_y);
             where(vec_next_target_indices == vec_source_index, vec_local_result) = 0.0;

             vec_contribution += vec_local_result;
         }

         vec_contribution.copy_to(potential.data() + i, Kokkos::Experimental::simd_flag_default);
     }

     value_type result{0.};

     // Loop through target data (remaining scalar part)
     for(std::size_t i = vec_size; i < size; ++i)
     {
         const value_type local_target_x = position_x[i], local_target_y = position_y[i];
         result = 0.;

         // Loop through source data
         for(std::size_t j = 0; j < size; ++j)
         {
             if(i != j)
             {
                 const value_type diff_x = local_target_x - position_x[j], diff_y = local_target_y - position_y[j];
                 result += charge[j] / std::sqrt(diff_x * diff_x + diff_y * diff_y);
             }
         }

         potential[i] = result;
     }
 }

4. Benchmarks and results

All the source codes of the internship are available in this repository. We provide a set of executables to test the different versions presented above. For executables that use Kokkos, it is important to note that they must be compiled with a Kokkos installation configured for the desired backend (e.g., CUDA, OpenMP, or C++ threads…). As briefly mentioned before, we used Guix to manage our development environments and ensure reproducibility. Since Kokkos configurations need to be set at compile time, several variants of the Kokkos package were introduced in Guix (e.g., kokkos-openmp, kokkos-cuda-a100, …). For reproducibility, we provide a dedicated channel file to leverage the guix time-machine command. Additionally, we created isolated development environments using the guix shell command combined with the --pure option that unsets existing environment variables to prevent any "collision" with external packages. Although the --container option would provide a fully isolated environment, it is not currently supported on PlaFRIM due to an outdated version of the running CentOS kernel, preventing containerization. Consequently, we only relied on the --pure option.

We present various experiments that can be run across different backends. For each binary, a list of available options can be viewed using the --help option as follows:

./excutable --help

4.1. On the CPU (using OpenMP)

First, we set-up the environment via Guix:

guix time-machine -C guix-tools/channels.scm -- shell --pure -m guix-tools/manifest-openmp.scm -- bash --norc

We get all the executables at once with these CMake commands:

rm -rf build-non-cuda && cmake --preset release-non-cuda
cmake --build --preset build-non-cuda

The following command runs the naive (fully sequential) version introduced earlier to compute interactions between 1,000 randomly generated particles in single precision, with the computation repeated 5 times in a row.

build-non-cuda/bench-ref --number-runs 5 --number-of-particles 1000 --use-float --verbose

As recommended by the documentation of Kokkos, we can set the KOKKOS_PRINT_CONFIGURATION environment variable to enable verbose output and display the current configuration.

export KOKKOS_PRINT_CONFIGURATION=1

Since, we use OpenMP as a backend, we can use OMP_PROC_BIND and OMP_PLACES to control how threads are assigned to CPU cores. If these variables are not set, Kokkos will display a warning message prompting you to configure them.

export OMP_PROC_BIND=spread
export OMP_PLACES=threads

We set OMP_PROC_BIND=spread to instruct the OpenMP runtime to distribute threads as evenly as possible across the available cores. Additionally, we configure OpenMP to place threads on separate hardware threads (i.e., logical cores when hyper-threading is enabled).

Finally, as usual, we can set the number of threads using OMP_NUM_THREADS, though it is recommended to use KOKKOS_NUM_THREADS instead, as it also applies to the C++ threads backend.

export KOKKOS_NUM_THREADS=2

The different versions can be run with the following commands, where the --check option compares the final numerical result with the reference result (baseline version).

build-non-cuda/bench-kokkos-flat-parallel-for --number-runs 5 --number-of-particles 1000 --use-float --verbose --check
build-non-cuda/bench-kokkos-flat-parallel-reduce --number-runs 5 --number-of-particles 1000 --use-float --verbose --check
build-non-cuda/bench-kokkos-nested-parallel --number-runs 5 --number-of-particles 1000 --use-float --verbose --check
build-non-cuda/bench-kokkos-hierarchical --number-runs 5 --number-of-particles 1000 --use-float --verbose --check

4.2. On the CPU (for vectorization)

For optimal vectorization, Guix allows packages to be marked as tunable, which is essential to fully take advantage of SIMD capabilities (as explained in the documentation). When using the --tune[=arch] option (typically --tune=native), binaries are compiled with the -march=arch flag (see x86 Options for examples), enabling them to specifically target the desired CPU architecture. Prior to the internship, the Kokkos packages available in Guix were not marked as tunable, so we created a local variant (defined in kokkos-variants.scm) to add this property, as described in the Guix reference manual. The original package is expected to be updated soon, so this custom version will no longer be necessary. As a result, we will be able to use the --tune[=arch] option directly. For example, for Intel Xeon Skylake Gold 6240 processors on Bora nodes in PlaFRIM (which support the AVX-512 instruction set), we provide a suitable manifest file that can be used as follows:

guix time-machine -C guix-tools/channels.scm -- shell --pure -m guix-tools/manifest-cpu-skylake.scm -L guix-tools/ -- bash --norc
rm -rf build-non-cuda && cmake --preset release-non-cuda
cmake --build --preset build-non-cuda

Once, everything is setup we can run the program using either xsmid:

build-non-cuda/bench-xsimd --number-runs 5 --number-of-particles 1000 --use-float --verbose --check

or Kokkos SIMD:

build-non-cuda/bench-kokkos-simd --number-runs 5 --number-of-particles 1000 --use-float --verbose --check

4.3. On the GPU (using CUDA)

Similarly, we can also perform experiments using CUDA as a backend. For instance, on a Sirocco node of PlaFRIM equipped with NVIDIA A100 GPUs:

guix time-machine -C guix-tools/channels.scm -- shell --pure -m guix-tools/manifest-a100.scm -- env LD_PRELOAD=/usr/lib64/libcuda.so:/usr/lib64/libnvidia-ptxjitcompiler.so bash --norc
rm -rf build-cuda && cmake --preset release-cuda
cmake --build --preset build-cuda

We can start by running the baseline CUDA version:

build-cuda/bench-cuda-shared --number-runs 5 --number-of-particles 1000 --use-float --verbose --check --threads-per-block 256

Likewise, for convenience, we set the KOKKOS_PRINT_CONFIGURATION variable to enable verbose output and display the current configuration.

export KOKKOS_PRINT_CONFIGURATION=1

As in the previous cases, the executables can be run as follows:

build-cuda/bench-kokkos-flat-parallel-for --number-runs 5 --number-of-particles 1000 --use-float --verbose --check
build-cuda/bench-kokkos-flat-parallel-reduce --number-runs 5 --number-of-particles 1000 --use-float --verbose --check
build-cuda/bench-kokkos-nested-parallel --number-runs 5 --number-of-particles 1000 --use-float --verbose --check
build-cuda/bench-kokkos-hierarchical --number-runs 5 --number-of-particles 1000 --use-float --verbose --check

4.4. Run all the experiments

Finally, we provide batch scripts to run directly all the experiments on PlaFRIM. This can be achieved using the following commands:

sbatch scripts/experiments-batch-bora-openmp.batch
sbatch scripts/experiments-batch-bora-threads.batch
sbatch scripts/experiments-batch-a100.batch
sbatch scripts/experiments-batch-v100.batch
sbatch scripts/experiments-batch-p100.batch
sbatch scripts/experiments-batch-broadwell.batch
sbatch scripts/experiments-batch-skylake.batch

4.5. Plots

After running all the experiments from the previous section, we provide a Python script plot-results.py to plot the results.

4.5.1. Comparison: xsimd vs Kokkos SIMD

mkdir -p img
python3 scripts/plot-results.py --data-files \
results/bench-xsimd-f64-sirocco15.plafrim.cluster-experiment-skylake.txt \
results/bench-kokkos-simd-f64-sirocco15.plafrim.cluster-experiment-skylake.txt \
results/bench-xsimd-f32-sirocco15.plafrim.cluster-experiment-skylake.txt \
results/bench-kokkos-simd-f32-sirocco15.plafrim.cluster-experiment-skylake.txt \
--colors "yellowgreen" "violet" "lightgreen" "slateblue" \
--labels "xsimd (fp 64)" "kokkos simd (fp 64)" "xsimd (fp 32)" "kokkos simd (fp 32)" \
--linewidths 2 3 2 3 \
--marker d d s s \
--linestyles "-" ":" "-" ":" \
--title $'Comparison Kokkos::simd and xsimd\nIntel Xeon Skylake Gold 6240 @ 2.6 GHz [AVX512]' \
--xlabel "size (N)" \
--ylabel "time (s)" \
--xscale log \
--yscale log \
--plot-file "img/plot-simd-skylake.png"

plot-simd-skylake.png

Figure 5: Comparison between xsimd and Kokkos SIMD (Intel Xeon Skylake Gold 6240)

4.5.2. Scalability (with OpenMP threads as a backend)

mkdir -p img
python3 scripts/plot-results.py --data-files \
results/bench-kokkos-flat-parallel-for-openmp-time-vs-num-threads-f32-bora012.plafrim.cluster.txt \
results/bench-kokkos-flat-parallel-reduce-openmp-time-vs-num-threads-f32-bora012.plafrim.cluster.txt \
results/bench-kokkos-nested-parallel-openmp-time-vs-num-threads-f32-bora012.plafrim.cluster.txt \
results/bench-kokkos-hierarchical-openmp-time-vs-num-threads-f32-bora012.plafrim.cluster.txt \
--colors "coral" "deepskyblue" "forestgreen" "darkgoldenrod" \
--labels "Flat version (parallel_for)" "Flat version (parallel_reduce)" "Nested version" "Hierarchical version" \
--linewidths 1 1 1 1 \
--marker d s v o \
--linestyles "-" "-" "-" "-" \
--title $'Scalability with Kokkos backend = OpenMP (N=65,536)\n(2x18 core Intel Xeon Skylake Gold 6240 @ 2.6 GHz)' \
--xlabel "Number of threads" \
--ylabel "Time (s)" \
--xscale linear \
--yscale log \
--x-data 0 \
--y-data 1 \
--plot-file "img/plot-openmp-time-vs-num-threads.png"

plot-openmp-time-vs-num-threads.png

Figure 6: Strong scalability results with OpenMP (Intel Xeon Skylake Gold 6240)

4.5.3. Scalability (with C++ threads as a backend)

mkdir -p img
python3 scripts/plot-results.py --data-files \
results/bench-kokkos-flat-parallel-for-threads-time-vs-num-threads-f32-bora013.plafrim.cluster.txt \
results/bench-kokkos-flat-parallel-reduce-threads-time-vs-num-threads-f32-bora013.plafrim.cluster.txt \
results/bench-kokkos-nested-parallel-threads-time-vs-num-threads-f32-bora013.plafrim.cluster.txt \
results/bench-kokkos-hierarchical-threads-time-vs-num-threads-f32-bora013.plafrim.cluster.txt \
--colors "coral" "deepskyblue" "forestgreen" "darkgoldenrod" \
--labels "Flat version (parallel_for)" "Flat version (parallel_reduce)" "Nested version" "Hierarchical version" \
--linewidths 1 1 1 1 \
--marker d s v o \
--linestyles "-" "-" "-" "-" \
--title $'Scalability with Kokkos backend = Threads (N=65,536)\n(2x18 core Intel Xeon Skylake Gold 6240 @ 2.6 GHz)' \
--xlabel "Number of threads" \
--ylabel "Time (s)" \
--xscale linear \
--yscale log \
--x-data 0 \
--y-data 1 \
--plot-file "img/plot-threads-time-vs-num-threads.png"

plot-threads-time-vs-num-threads.png

Figure 7: Strong scalability results with C++ threads (Intel Xeon Skylake Gold 6240)

4.5.4. Comparison of the strategies on the GPU (NVIDIA)

mkdir -p img
python3 scripts/plot-results.py --data-files \
results/bench-kokkos-flat-parallel-for-cuda-f32-sirocco22.plafrim.cluster-experiment-a100.txt \
results/bench-kokkos-flat-parallel-reduce-cuda-f32-sirocco22.plafrim.cluster-experiment-a100.txt \
results/bench-kokkos-nested-parallel-cuda-f32-sirocco22.plafrim.cluster-experiment-a100.txt \
results/bench-kokkos-hierarchical-cuda-f32-sirocco22.plafrim.cluster-experiment-a100.txt \
results/bench-cuda-shared-f32-sirocco22.plafrim.cluster-experiment-a100.txt \
--colors "coral" "deepskyblue" "seagreen" "cyan" "red" \
--labels "Flat version (parallel_for)" "Flat version (parallel_reduce)" "Nested version" "Hierarchical version" "CUDA (reference)"\
--linewidths 2 2 2 2 1 \
--marker d s o v s\
--linestyles "-" "-" "-" "-" ":" \
--title $'Comparison of the different approaches (GPU: NVIDIA A100) [f32]' \
--xlabel "size (N)" \
--ylabel "time (s)" \
--xscale log \
--yscale log \
--plot-file "img/plot-kokkos-comparison-cuda-f32-time.png"

plot-kokkos-comparison-cuda-f32-time.png

Figure 8: Comparison on GPU (NVIDIA A100)

4.5.5. Observations

In Figure 5, we can notice that both xsimd and Kokkos SIMD achieved comparable performance results on this specific kernel. The final comparison in Figure 8 indicates that the different implementations yielded results similar to those of the CUDA version. However, with the exception of the version that uses the Kokkos::parallel_reduce instruction, all other versions delivered similar performance outcomes. Given the use of the scratch pad memory, we might have expected significantly better results from the Kokkos-based hierarchical versions compared to the flat versions. This discrepancy suggests a need for further investigation, which is currently ongoing. Additionally, it would be valuable to repeat these experiments using a more complex interaction function resulting in a higher computational workload, which is also a work in progress.

5. Closing remarks

This work initially aimed to explore what Kokkos offers for writing portable code. During the project, we successfully implemented several variants of the P2P inner kernel, achieving interesting performance results. However, while the hierarchical version was expected to outperform the flat versions, this was not exactly the case. Further investigation is therefore needed to develop a more effective hierarchical approach. Moving forward, the goal is now to extend the work done for the basic electrostatic kernel to more generic kernels, allowing us to handle a wider range of interactions, such as:

\begin{equation*} k(x,y) = \frac{(x - y)}{\|x - y\|^3}, \end{equation*} \begin{equation*} k(x,y) = \ln{\bigl\{\| x - y\|\bigl\}}, \end{equation*} \begin{equation*} k(x,y) = \exp{\bigl\{\frac{- \| x - y \|}{ \sigma^2 }\bigl\}}, \end{equation*} \begin{equation*} k(x,y) = 3 \frac{(x - y) \otimes (x - y)}{\|x - y\|^5} - \frac{1}{\|x - y\|} \textbf{I}_{3 \times 3}. \end{equation*}

Such work would actually provide an opportunity to explore the different possibilities offered by Kokkos views. Additionally, it's worth noting that we only tested Kokkos SIMD with a basic kernel that used common operations, and it performed comparably to state-of-the-art libraries like xsimd. It would be worthwhile to repeat these experiments with more complex kernels (involving functions like logarithmic or exponential operations, which lack pre-implemented SIMD intrinsics) to better assess the extent of support compared to other leading SIMD libraries. A future goal, as it was originally planned, would be to integrate Kokkos-based kernels into a program that uses StarPU to explore how these components can be effectively combined.

6. References

Blanchard, Pierre. 2017. “Fast Hierarchical Algorithms for the Low-Rank Approximation of Matrices, with Applications to Materials Physics, Geostatistics and Data Analysis.” Université de Bordeaux.
Edwards, H. Carter, Christian R. Trott, and Daniel Sunderland. 2014. “Kokkos: Enabling Manycore Performance Portability through Polymorphic Memory Access Patterns.” Journal of Parallel and Distributed Computing 74 (12): 3202–16. https://doi.org/https://doi.org/10.1016/j.jpdc.2014.07.003.
Greengard, Leslie, and Vladimir Rokhlin. 1987. “A Fast Algorithm for Particle Simulations.” Journal of Computational Physics 73 (2): 325–48.
Nylons, Lars, Mark Harris, and Jan Prins. 2007. “Fast N-Body Simulation with Cuda.” Gpu Gems 3: 62–66.
Trott, Christian R., Damien Lebrun-Grandié, Daniel Arndt, Jan Ciesko, Vinh Dang, Nathan Ellingwood, Rahulkumar Gayatri, et al. 2022. “Kokkos 3: Programming Model Extensions for the Exascale Era.” Ieee Transactions on Parallel and Distributed Systems 33 (4): 805–17. https://doi.org/10.1109/TPDS.2021.3097283.

Date: 07/11/2024 at 10:20:05

Author: Aurélien Gauthier, Antoine Gicquel

Created: 2024-11-07 Thu 10:20

Validate