TTG

This is the C++ API for the Template Task Graph (TTG) programming model for flowgraph-based composition of high-performance algorithms executable on distributed heterogeneous computer platforms. The TTG API abstracts out the details of the underlying task and data flow runtime; the current realization is implemented using MADNESS and PaRSEC runtimes as backends.

Why TTG?

TTG might be for you if you want fine-grained parallel execution of complex (especially, data-dependent) algorithms on distributed-memory heterogeneous machines, for these reasons:

programming models that target fine-grained parallelism, like native language tools (threads, async) and programming models/libraries (OpenMP, TaskFlow, Cilk, etc.) deal only with control flow, and thus are poorly suited for dealing with data-dependent execution
such models do not deal with distributed memory anyway
and specialized runtimes like StarPU, PaRSEC, MADNESS, HPX, UPC++, etc., are still relatively low-level abstractions for expressing complex data-dependent task flows across modern distributed heterogeneous machines.

The development of TTG was motivated by irregular scientific applications like adaptive multiresolution numerical calculus and data-sparse tensor algebra which have lacked tools to keep up with the evolution of HPC platforms, especially toward heterogeneity. But TTG is far more widely applicable than that; it is a general-purpose programming model.

Installation

To try out TTG in a Docker container, install Docker, then execute bin/docker-build.sh and follow instructions in bin/docker.md;
See INSTALL.md to learn how to build and install TTG.

A Short Intro to TTG

TL;DR: A "Hello, World" TTG Program

helloworld.cpp

#include <ttg.h>
 
int main(int argc, char *argv[]) {
  ttg::initialize(argc, argv);
  // a simple template task (TT)
  auto tt = ttg::make_tt([]() { std::cout << "Hello, World!\n"; });
 
  // single TT is also a TT graph (=TTG); signal that we are done constructing the TTG
  ttg::make_graph_executable(tt);
  // start executing any available tasks
  ttg::execute();
  // create task to kickstart computation
  if (ttg::get_default_world().rank() == 0) tt->invoke();
  // wait for completion
  ttg::fence();
 
  ttg::finalize();
  return 0;
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.19)
project(TTG-HW CXX)
 
find_package(ttg QUIET) # check if TTG is already available
if (NOT TARGET ttg-parsec) # else build from source
  include(FetchContent)
  FetchContent_Declare(ttg GIT_REPOSITORY https://github.com/TESSEorg/ttg.git)
  FetchContent_MakeAvailable( ttg )
endif()
 
add_executable(helloworld-parsec helloworld.cpp)
target_link_libraries(hw-parsec PRIVATE ttg-parsec)
target_compile_definitions(hw-parsec PRIVATE TTG_USE_PARSEC=1)

Configure + build:

> cmake -S . -B build && cmake --build build --target helloworld-parsec

The complete example, including the CMake build harness using a slightly easier way to build the executable (using add_ttg_executable CMake macro), can be found in dox examples.

"Hello, World!" Walkthrough

Although it does not involve any useful flow of computation and/or data, the above "Hello, World!" TTG program introduces several key TTG concepts and illustrates what you need to do to write a complete TTG program. So let's walk through it.

Programming Model

The basic model of computation is built around a Template Task Graph (TTG). A TTG consists of one or more connected Template Task (TT) objects. Each message that travels between TTs consist of a (potentially void) task ID and (optional) datum. A TT creates a task for a given task ID when all of its input terminals have received a message with that task ID. The task body can send data to zero or more of the output terminals defined for the corresponding TT.

Thus, task creation is a byproduct of messages traveling through one or more TTGs. What makes the model powerful is the ability to encode large DAGs of tasks compactly.

Before proceeding further, let's refine the few concepts used to define the programming model above:

TaskId (aka Key): A unique identifier for each task. It must be perfectly hashable.
Terminal: A port for receiving (input) and sending (output) messages. Each message consists of a (potentially void) TaskId and an (optional) datum. Terminals are strongly-typed. An {in,out}put terminal can be connected to one or more {out,in}put terminal (as long as the TaskId and datum types match). Input terminals are programmable (e.g., incoming messages can be optionally reduced).
TemplateTask (aka TT): This is a template for creating tasks. The task template creates a task associated with a given TaskId when every input terminal received messages for the given TaskId.
Edge: A connection between an input terminal and an output terminal. An Edge denotes a 1-to-1 connection and exists to be able to think of TTGs as graphs ("data flows between TTs' terminals via Edges"); do not confuse with the TTG C++ class Edge which behaves like a hyperedge by composing 1-to-many and many-to-1 connections between terminals.

Due to its simplicity only one template task appears in the "Hello, World!" program.

Structure of a Minimal TTG Program

Every TTG program must:

select the TTG backend,
initialize the TTG runtime,
construct a TTG by declaring its constituent nodes,
make TTG executable and kickstart the execution by sending a control or data message to the TTG,
shut down the runtime

Let's go over each of these steps using the "Hello, World!" example. The complete example, including the CMake build harness, can be found in dox examples.

Select the TTG Backend

The TTG C++ implementation is currently supported by 2 backends providing task scheduling, data transfer, and resource management. While it is possible to use a specific TTG backend explicitly, by using the appropriate namespaces, it is recommended to write backend-neutral programs that can be specialized to a particular backend in of the two ways.

By defining one (and only one) of the following macros, via the command-line argument to the compiler (recommended) or as an explicit #define statement in the source code:
- TTG_USE_PARSEC: selects the PaRSEC backend as the default;
- TTG_USE_MADNESS: selects the MADNESS backend as the default (expert-use only).
  
  Following the definition of this macro it is safe to include the top-level TTG header file:
  
  cpp #include <ttg.h>
By including the corresponding backend-specific header directly:
- to use PaRSEC backend only, add:
  
  cpp #include <ttg/parsec/ttg.h>
- to use the MADNESS backend only, add:
  
  cpp #include <ttg/madness/ttg.h>
This approach does not require inclusion of the top-level TTG header or definition of a backend selection macro.

Initialize

To initialize the TTG runtime invoke ttg::initialize(argc, argv). There are several overloads of this function that also accept other optional parameters, such as the number of threads in the main thread pool, the MPI communicator for execution, etc.

Create a TTG

To make a TTG, create and connect one or more TTs. The simplest TTG consists of a single TT as in the "Hello, World!" example.

Since the TT in the "Hello, World!" example generates a single task, the task ID can be omitted (its type is void). This task also does not take or produce any data. The easiest way to make such a TT is by wrapping a callable (e.g., a lambda) with ttg::make_tt:

auto tt = ttg::make_tt([]() { std::cout << "Hello, World!\n"; });

Execute TTG

To execute a TTG we must make it executable (this will declare the TTG program complete so no additional changes to the flowgraph are possible). To execute the TTG its root TT must receive at least one message; since in this case the task does not receive either task ID or data, the message is empty (i.e., void):

ttg::make_graph_executable(tt);
ttg::execute();
if (ttg::get_default_world().rank() == 0)
    tt->invoke();

ttg::execute() starts the TTG work engines and must occur before, not after, kickstarting off any computation by sending the data into the free in-terminals. The kickstart messages are generated by invoking TT::invoke(taskID, values); in this case the task ID is void, and the task does not take any input data, so TT::invoke() receivesno parameters. Since TTG uses the Single Program Multiple Data (SPMD) execution model, every process invokes ttg::execute() when this TTG program is invoked as multiple processes, but only the first process (rank) gets to send the kickstart message since all messages in a TTG program, including the kickstart messages, must be unique.

Finalize TTG

Since TTG program is executed asynchronously, we must ensure that all tasks are finished:

ttg::fence();

Before exiting main() the TTG runtime should be finalized:

ttg::finalize();

Beyond "Hello, World!"

Since "Hello, World!" consists of a single task it does not demonstrate either how to control scheduling of multiple tasks or enable data flow between tasks. Let's use the computation of the Nth Fibonacci number as a simple example of a recursive task-based computation that is often used (OpenMP, TBB, Legion, Cilk) to illustrate basic features of task-based programming models. Although the example lacks opportunity for parallelism, the point here is not performance but its simplicity.

Example: <tt>N</tt>th Fibonacci Number

This example illustrates how to compute a particular element of the Fibonacci sequence defined by recurrence $F_N = F_{N-1} + F_{N-2}, F_0=0, F_1=1$.

nth-fibonacci.cpp

#include <ttg.h>
 
int main(int argc, char *argv[]) {
  ttg::initialize(argc, argv);
 
  const int64_t N = 20; // want to compute fib(20)
  // edges used for recursion
  ttg::Edge<int64_t, int64_t> f2f_nm1, f2f_nm2;
  // edge to the task printing the output
  ttg::Edge<void, int64_t> f2p;
  auto fib = ttg::make_tt(
      [=](int64_t n, int64_t F_nm1, int64_t F_nm2) {
        auto F_n = F_nm1 + F_nm2;
        if (n < N) {
          // recursion: send result to first input and
          //            prior result to second input of Fib(n+1)
          ttg::send<0>(n + 1, F_n);
          ttg::send<1>(n + 1, F_nm1);
        } else {
          // send to print task below
          ttg::sendv<2>(F_n);
        }
      },
      // input edges: first input, second input
      ttg::edges(f2f_nm1, f2f_nm2),
      // output edges: to first input, to second input, to print task
      ttg::edges(f2f_nm1, f2f_nm2, f2p),
      // name of the task
      "fib");
  auto print = ttg::make_tt([](int64_t F_N) { std::cout << N << "th Fibonacci number is " << F_N << std::endl; },
                            // input from fib
                            ttg::edges(f2p),
                            // no outputs
                            ttg::edges(),
                            "print");
 
  ttg::make_graph_executable(fib);
  ttg::execute();
  if (ttg::get_default_world().rank() == 0) fib->invoke(2, std::make_tuple(1, 0));
  ttg::fence();
 
  ttg::finalize();
  return 0;
}

The TTG consists of 2 TTs, one (fib) that implements the Fibonacci recurrence and another (print) that prints the result to std::cout:

fib computes $F_{n}$ from $F_{n-1}$ and $F_{n-2}$ and either sends $F_{n}$ and $F_{n-1}$ to the next ($n+1$) instance of fib, or, if $n=N$, sends $F_{n}$ to print. Thus fib needs 2 input terminals and 3 output terminals (for better efficiency instead of sending individual Fibonacci numbers, each over an individual edge, it is better to send a pair of Fibonacci numbers over a single edge).
print receives a single unannotated datum and produces no data, so it needs a single input terminal and no output terminals.

Execution of the program starts by explicitly instantiating fib for $n=2$. In total 20 tasks will be executed: 19 instances of fib with $n=2\dots20$ and the single instance of print.

Note that unlike typical task-based implementations in the literature which construct tasks recursively, i.e., the task for computing $F_{n}$ is created before the task computing $F_{n-1}$, the TTG implementation constructs the tasks in the order of increasing $n$. This is because parametric dataflow of TTG naturally expresses inductive (push) computation patterns rather than recursive (pull) computation patterns. However, it is easy to implement proper recursion by separating the downward flow of control (task creation, $F_{n} \to F_{n-1},F_{n-2}$) from the upward flow of data (task evaluation, $F_{n-1},F_{n-2} \to F_{n}$).

Data-Dependent Example : Largest Fibonacci Number < N

To illustrate the real power of TTG let's tweak the problem slightly: instead of computing the first $N$ Fibonacci numbers let's find the largest Fibonacci number smaller than some $N$. The key difference in the latter case is that, unlike the former, the number of tasks is NOT known a priori; furthermore, to make a decision whether we need to compute next Fibonacci number we must examine the value returned by the previous task. This is an example of data-dependent tasking, where the decision which (if any) task to execute next depends on the values produced by previous tasks. The ability to compose regular as well as data-dependent task graphs is a distinguishing strength of TTG.

To make things even more interesting, we will demonstrate how to implement such program both for execution on CPUs as well as on accelerators (GPUs). The complete examples, including the CMake build harness, can be found in dox examples.

The CPU Version

#include <ttg.h>
#include "ttg/serialization.h"
 
struct Fn {
  int64_t F[2];  // F[0] = F_n, F[1] = F_{n-1}
  Fn() { F[0] = 1; F[1] = 0; }
  // make Fn serializable
  template <typename Archive>
  void serialize(Archive& ar) {
    ar & F;
  }
  template <typename Archive>
  void serialize(Archive& ar, const unsigned int) {
    ar & F;
  }
};
 
auto make_ttg_fib_lt(const int64_t F_n_max) {
  ttg::Edge<int64_t, Fn> f2f; // fib to fib
  ttg::Edge<void, Fn> f2p;    // fib to print
 
  auto fib = ttg::make_tt(
      [=](int64_t n, Fn&& f_n) {
        int64_t next_f_n = f_n.F[0] + f_n.F[1];
        f_n.F[1] = f_n.F[0];
        f_n.F[0] = next_f_n;
        if (next_f_n < F_n_max) {
          // send to next Fib
          ttg::send<0>(n + 1, f_n);
        } else {
          // send to print
          ttg::send<1>(n, f_n);
        }
      },
      ttg::edges(f2f), ttg::edges(f2f, f2p), "fib");
 
  auto print = ttg::make_tt(
      [=](const Fn& f_n) {
        std::cout << "The largest Fibonacci number smaller than " << F_n_max << " is " << f_n.F[1] << std::endl;
      },
      ttg::edges(f2p), ttg::edges(), "print");
  // create a TTG that receives inputs on the first input of fib
  auto ins = std::make_tuple(fib->template in<0>());
  // collect fib and print into a vector
  std::vector<std::unique_ptr<ttg::TTBase>> tts;
  tts.emplace_back(std::move(fib));
  tts.emplace_back(std::move(print));
  // instantiate the TTG; note the use of make_ttg instead of make_tt
  return make_ttg(std::move(ops), ins, std::make_tuple(), "Fib_n < N");
}
 
int main(int argc, char* argv[]) {
  ttg::initialize(argc, argv, -1);
  int64_t N = 1000;
  if (argc > 1) N = std::atol(argv[1]);
 
  auto fib = make_ttg_fib_lt(N);
  ttg::make_graph_executable(fib.get());
  ttg::execute();
  if (ttg::get_default_world().rank() == 0)
    fib->invoke(1, Fn{});
 
  ttg::fence();
  ttg::finalize();
  return 0;
}

Utility of <tt>Fn</tt> struct

Fn aggregates 2 pieces of data that were separate before in preparation for aggregating datums into single continguous chunks that can be allocated on GPU more efficiently.This arrangement allows each task to access and modify both current and previous Fibonacci values without the need for separate data fields or additional communication overhead.

F[0] and F[1] store the current ($F_n$) and previous ($F_{n-1}$) Fibonacci numbers, respectively.
The default constructor starts the iteration by initializing F[0]=1 and F[1]=0.

Because Fn is now a user-defined type, for TTG to be able to copy/move it between tasks it needs to know how to serialize and deseralize it. functions are useful to communicate the struct among the tasks. TTG leverages these functions to serialize and deserialize the data as it is sent and received through the task graph.

Why <tt>make_ttg_fib_lt</tt>?

Until now we have constructed individual TTs and linked them together; i.e., the TTG until now was implicit. The function make_ttg_fib_lt instead explicitly creates a graph of TTs (a TTG). This seemingly small step helps improve composability by allowing to use entire TTGs as a component of other graphs by stitching it with TTs or TTGs together.

Device Version

It is currently not possible to have a general-purpose task runtime execute purely on device, hence TTG and the underlying runtimes execute tasks on the host (CPU), and these tasks launch device kernels. For technical reasons it is necessary to split the code into the host-only part, which looks remarkably like the CPU-only version above, and the device-specific part that implements the core part of the computation on the device. In the future it may become possible to have single-source programs that contain both host and device parts contain in the same source file.

Host-side Code

The host-only part is completely independent of the type of the device programming model.

struct Fn : public ttg::TTValue<Fn> {
  std::shared_ptr<int64_t[]> F;  // F[0] = F_n, F[1] = F_{n-1}
  ttg::Buffer<int64_t> b; // buffer managing host and device memory
  Fn() : F(std::make_shared<int64_t[]>(2)), b(F.get(), 2) { F[0] = 1; F[1] = 0; }
  Fn(const Fn&) = delete;
  Fn(Fn&& other) = default;
  Fn& operator=(const Fn& other) = delete;
  Fn& operator=(Fn&& other) = default;
  template <typename Archive>
  void serialize(Archive& ar) {
    ar & F[0] & F[1] & b;
  }
  template <typename Archive>
  void serialize(Archive& ar, const unsigned int) {
    ar & F[0] & F[1] & b;
  }
};
 
auto make_ttg_fib_lt(const int64_t F_n_max = 1000) {
  ttg::Edge<int64_t, Fn> f2f;
  ttg::Edge<void, Fn> f2p;
 
  auto fib = ttg::make_tt<ES>(
      [=](int64_t n, Fn&& f_n) -> ttg::device::Task {
        assert(n > 0);
        ttg::trace("in fib: n=", n, " F_n=", f_n.F[0]);
 
        // select a device and make b available
        co_await ttg::device::select(f_n.b);
        // compute the next value on the device (see below)
        next_value(f_n.b.current_device_ptr());
 
        // wait for the task to complete and the values to be brought back to the host
        co_await ttg::device::wait(f_n.b);
 
        if (f_n.F[0] < F_n_max) {
          co_await ttg::device::forward(ttg::device::send<0>(n + 1, std::move(f_n)));
        } else {
          co_await ttg::device::forward(ttg::device::sendv<1>(std::move(f_n)));
        }
      },
      ttg::edges(f2f), ttg::edges(f2f, f2p), "fib");
 
  auto print = ttg::make_tt(
      [=](Fn&& f_n) {
        std::cout << "The largest Fibonacci number smaller than " << F_n_max << " is " << f_n.F[1] << std::endl;
      },
      ttg::edges(f2p), ttg::edges(), "print");
 
  auto ins = std::make_tuple(fib->template in<0>());
  std::vector<std::unique_ptr<::ttg::TTBase>> ops;
  ops.emplace_back(std::move(fib));
  ops.emplace_back(std::move(print));
  return make_ttg(std::move(ops), ins, std::make_tuple(), "Fib_n < N");
}
 
int main(int argc, char* argv[]) {
  ttg::initialize(argc, argv, -1);
  int64_t N = 1000;
  if (argc > 1) N = std::atol(argv[1]);
 
  auto fib = make_ttg_fib_lt(N);
  ttg::make_graph_executable(fib.get());
  ttg::execute();
  if (ttg::default_execution_context().rank() == 0)
    fib->template in<0>()->send(1, Fn{});;
 
  ttg::fence();
  ttg::finalize();
  return 0;
}

Although the structure of the device-capable program is nearly identical to the CPU version, there are important differences:

Fn's data must exist on the host side (where the task is executed). To automate moving of the data between host and device memories Fn is implemented with the help of helper classes TTValue and Buffer.
task functions become coroutines (as indicated by their return type device::Task) to deal with the asynchrony of the host-device interactions (kernel launch, memory allocation and transfers)
the target execution space is specified as a template argument of type ExecutionSpace to make_tt

`TTValue`

For optimal performance, the low-level runtime that manages the data motion across the memory hierarchy (host-to-host (i.e., between MPI ranks), host-to-device, and device-to-device) and so it must be able to track each datum as it orchestrates the computation. For example, when a TTG task sends a datum to an output terminal connected to multiple consumers the runtime may avoid unnecessary copies, e.g., by recognizing that all consumers will only need read-only access to the data, hence reference to the same datum can be passed to all consumers. This requires the mapping of a pointer to a C++ object to the control block that describes that object to the runtime. Deriving C++ type T from TTValue<T> includes the control block in T and avoids creating a separate control block. This is particularly important for the data that has to travel to the device.

`Buffer`

Buffer<T> is a view of a contiguous sequence of objects of type T in the host memory that can be automatically moved by the runtime to/from the device memory. Here Fn::b is a view of the 2-element sequence pointed to by Fn::F; once it's constructed the content of Fn::F will be moved to/from the device by the runtime. The subsequent use of Fn::b cause the automatic transfers of data to (device::select(f_n.b)) and from (ttg::device::wait(f_n.b)) the device. A Buffer<T> can be either freestanding or with lifetime tied to the lifetime of the host buffer; the latter is used in the example above, indicated by the use of shared_ptr to manage the lifetime of the host buffer. If no pointer is passed to the constructor of Buffer<T> the buffer allocates the necessary host-side memory. In order to guarantee relocatability of buffers, the data managed by a buffer should be located on the heap, i.e., dynamically allocated.

`device::Task`

The key challenge of device programming models is that they are fundamentally asynchronous to hide the latency of interacting with the device. Kernel launches, unlike function calls on CPU, as well as memory transfers take 1000s of CPU cycles, and the asynchrony helps amortize these costs by overlapping kernels launch and execution. Task programming models are a seemingly good match for device programming, but the key challenge is how to make device-capable task code look most like standard host-only task code. TTG ability to use C++ coroutines as task bodies allows it to deal with asynchronous calls inside the tasks (the use of coroutines is the primary reason why TTG requires C++20 support by the C++ compiler). Roughly speaking, coroutines are resumable functions; they can return to the caller via a co_await statement and be resumed at that point once some condition (typically, completion of submitted actions) has been satisfied. Device tasks co_await at every point where further progress requires completion of preceding device tasks:

The first co_await ttg::device::select asks the runtime to select a device (if multiple are available) and ensures that the contents of f_n.F[] are made available on that device. During the first invocation the data resides on the host, hence the runtime allocates memory on the device and transfers the contents of f_n.F[] from host to device. During subsequent invocations the contents of f_n.F[] are likely already available on the device (unless the runtime decides to compute $F_{n+1}$ on a different device than $F_n$), thus this co_await may become a no-op.
The second co_await ttg::device::wait ensures that the kernel launched by next_value has completed and the contents of f_n.F[] changed by that kernel are available on the host. This always causes device-to-host transfer if one or more Buffer<T> are provided. If no buffer is provided then the call only waits for all previously submitted kernels to complete.
The last set of co_await's ensures that the corresponding ttg::device::send, which sends the data located in the device memory, has completed. Since device::send within a task may return a local variable (e.g., for the key) exit from the coroutine would destroy such variables prematurely, hence instead of a co_return the coroutine concludes by waiting for the device::send to complete before exiting.

`ExecutionSpace`

TTG and its underlying runtime needs to be told in which execution space the task code will operate in. The current choices are denoted by the ExecutionSpace enumeration:

ExecutionSpace::Host: host processor (default)
ExecutionSpace::CUDA: an NVIDIA CUDA device
ExecutionSpace::HIP: an AMD HIP device
ExecutionSpace::L0: an Intel L0 device

Device Kernel

Here's the CUDA version of the device kernel and its host-side wrapper; ROCm and SYCL/Level0 variants will be very similar to the CUDA version:

#include "fibonacci_cuda_kernel.h"
    __global__ void cu_next_value(int64_t* fn_and_fnm1) {
      int64_t fnp1 = fn_and_fnm1[0] + fn_and_fnm1[1];
      fn_and_fnm1[1] = fn_and_fnm1[0];
      fn_and_fnm1[0] = fnp1;
    }
    void next_value(int64_t* fn_and_fnm1) {
      cu_next_value<<<1, 1>>>(fn_and_fnm1);
    }

cu_next_value is the device kernel that evaluates $F_{n+1}$ from $F_{n}$ and $F_{n-1}$. next_value is a host function that launches cu_next_value; this is the function called in the fib task.

Debugging TTG Programs

TTG Visualization

TTGs can be exported in the DOT format as follows:

std::cout << ttg::Dot()(tt.get()) << std::endl;

ttg::Dot

Prints the graph to a std::string in the format understood by GraphViz's dot program.

Definition dot.h:14

Use GraphViz to visualize the resulting graph.

Task Graph Visualization

Exporting the DAG of tasks resulting from execution of a TTG will be possible as soon as PR 227 has been merged.

Launching a Debugger

To simplify debugging of multirank TTG programs it is possible to automate the process as follows:

If an X11 server is running (check if environment variable DISPLAY is set), then set environment variable TTG_DEBUGGER to {gdb_xterm,lldb_xterm} to launch {gdb,lldb} upon receiving a signal like SIGSEGV or SIGABRT (one xterm window per rank will be created);
If an X11 server is not running the set TTG_DEBUGGER to empty value; upon receiving a signal the program will print instructions for how to attach a debugger to a running process from another terminal.
run the ttg program and if it receives any signal the xterm windows should pop up to display debugging results

TTG Performance

Competitive performance of TTG for several paradigmatic scientific applications on shared- and distributed-memory machines (CPU only) was discussed in manuscripts `‘Generalized Flow-Graph Programming Using Template Task-Graphs: Initial Implementation and Assessment’' and `‘Composition of Algorithmic Building Blocks in Template Task Graphs’'. Low-level benchmarking of TTG tasking was reported in manuscript `‘Pushing the Boundaries of Small Tasks: Scalable Low-Overhead Data-Flow Programming in TTG’'.

TTG Performance Tracing

There are several ways to trace execution of a TTG program. The easiest way is to use the PaRSEC-based TTG backend to produce binary traces in PaRSEC Binary Trace (PBT) format and then convert them to a Chrome Trace Format (CTF) JSON file that can be visuzalized using built-in browser in Chrome browser or using web-based Perfetto trace viewer. To generate the trace results of any TTG program follow the process discussed below:

For simplicity, we assume here that TTG will build PaRSEC from source. Make sure PaRSEC Python tools prerequisites have been installed, namely Python3 (version 3.8 is recommended) and the following Python packages (e.g., using pip):
- cython
- numpy
- pandas
- tables
Configure and build TTG:
- Configure TTG with -DPARSEC_PROF_TRACE=ON (this turns on PaRSEC task tracing) and -DBUILD_SHARED_LIBS=ON (to support PaRSEC Python tools). Also make sure that CMake discovers the Python3 interpreter and the cython package.
- Build and install TTG
Build the TTG program to be traced.
Run the TTG program with tracing turned on:
- Set the environment variables PARSEC_MCA_mca_pins and PARSEC_MCA_profile_filename to task_profiler and the PBT file name prefix (e.g. /tmp/ttg), respectively.
- Run the program and make sure the trace files (in PBT format) have been generated; e.g., if you set PARSEC_MCA_profile_filename to /tmp/ttg you should find file /tmp/ttg-0.prof-... containing the trace from MPI rank 0, /tmp/ttg-1.prof-... from rank 1, and so on.
Convert the traces from PaRSEC Binary Trace (PBT) format to the Chrome Trace Format (CTF):
- Add {TTG build directory}/_deps/parsec-build/tools/profiling/python/python.test (currently it is not possible to use PaRSEC Python module from the install tree, only from its build tree) to the PYTHONPATH environment variable so that the Python interpreter can find the modules for reading the PaRSEC trace files.
- Convert the PBT files to a CTF file by running the conversion script:
  {TTG install prefix}/bin/pbt_to_ctf.py {PBT file name prefix} {CTF filename}
Open the chrome://tracing URL in the Chrome browser and load the resulting trace; alternatively you can use the Perfetto trace viewer from any browser.

For example, executing the Fibonacci program described above using 2 MPI processes and with 2 threads each will produce a trace that looks like this:

Fibonacci_traces_example

TTG reference documentation

TTG API documentation is available online.

Cite

When referring to TTG in an academic setting please cite the following publication:

G. Bosilca, R. J. Harrison, T. Herault, M. M. Javanmard, P. Nookala and E. F. Valeev, "The Template Task Graph (TTG) - an emerging practical dataflow programming paradigm for scientific simulation at extreme scale," 2020 IEEE/ACM Fifth International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), 2020, pp. 1-7, doi: 10.1109/ESPM251964.2020.00011.

Acknowledgment

The development of TTG was made possible by:

The EPEXA project, currently supported by the National Science Foundation under grants 1931387 to Stony Brook University, 1931384 to the University of Tennessee, Knoxville, and 1931347 to Virginia Tech.
The TESSE project, supported by the National Science Foundation under grants 1450344 to Stony Brook University, 1450300 at the University of Tennesse, Knoxville, and 1450262 to Virginia Tech.