See also:

  • performance
  • algorithms
  • concurrency

Umut Acar Parallel Computing: Theory and Practice Intro in C++ 8 Oregon Programming Languages Summer School parallel algorithms sam westwick mpl “maple” a parallel ml compiler

Disentangling gpu programming when why how


Why Roofline model latency vs bandwidth arithmetic intensity

SM streaming multiprocessor warp 32 threads


tensor cores docs nvcc compiler -gencode -arch. godbolt nvvm nvtx nvdisasm nvprune cuda-gdb cudafe++ - sperates host code from gpu cuobjdump nvprof nvlink ptxas bin2c

cd /usr/local/cuda-12.3/bin/
#./nvcc --help

nvvm ir llvm ir variant cupti cude profiling tool interface. visual profiler. libnvvp performance analysis tool C extensions <<< >>> execution configurqation. It transpiles to cuda runtime calls

PTX virtual instruction set. JITed by driver isa changes between cards (sm_86, sm_80, sm_75, sm_70, sm_60) ghidra spec for one gpu. not very complete looking A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX Code

runtime cudart.a cudaInitDevice() cudaSetDevice() cudaMalloc cudaFree

thrust cub cuda core lbraries numba has cuda simulator triton jax torch.compile

cudnn physx tensorrt cublas curand cufft cusolver cusparse npp nvidia perfroamcne primitives nvml management library cudart - runtme nvrtc runtime compilation cutlass (cuco)

book - programming massively parallel processors

pycuda A code generator for array-based code on CPUs and GPUs

import pycuda gpu dataframe

opencl pyopencl for my integrated graphics. graphics memory management library

create context, create queue, create buffer, create program, create kernel, set arguments, enqueue kernel, enqueue copy, enqueue map, release

import numpy as np
import pyopencl as cl

a_np = np.random.rand(50000).astype(np.float32)
b_np = np.random.rand(50000).astype(np.float32)

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

mf = cl.mem_flags
a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_np)
b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b_np)

prg = cl.Program(ctx, """
__kernel void sum(
    __global const float *a_g, __global const float *b_g, __global float *res_g)
  int gid = get_global_id(0);
  res_g[gid] = a_g[gid] + b_g[gid];

res_g = cl.Buffer(ctx, mf.WRITE_ONLY, a_np.nbytes)
knl = prg.sum  # Use this Kernel object for repeated calls
knl(queue, a_np.shape, None, a_g, b_g, res_g)

res_np = np.empty_like(a_np)
cl.enqueue_copy(queue, res_np, res_g)

# Check on CPU with Numpy:
print(res_np - (a_np + b_np))
print(np.linalg.norm(res_np - (a_np + b_np)))
assert np.allclose(res_np, a_np + b_np)

sudo apt install opencl-headers ocl-icd-opencl-dev -y
echo "
// C standard includes
#include <stdio.h>

// OpenCL includes
#include <CL/cl.h>

int main()
    cl_int CL_err = CL_SUCCESS;
    cl_uint numPlatforms = 0;

    CL_err = clGetPlatformIDs( 0, NULL, &numPlatforms );

    if (CL_err == CL_SUCCESS)
        printf(\"%u platform(s) found\n\", numPlatforms);
        printf(\"clGetPlatformIDs(%i)\n\", CL_err);

    return 0;
" > /tmp/test.c
gcc -Wall -Wextra -D CL_TARGET_OPENCL_VERSION=300  /tmp/test.c -o /tmp/test -lOpenCL 
# -std=c++11 -lOpenCL /tmp/test.cpp -o /tmp/test
/tmp/test tensor tiling library fluid lattice botzlamnn Smoothed Particle Hydrodynamics ( opencl wrapper “dead simple”




Vulkan opencl to vulkan compiler




Parallel Scan Sort Reduce parallel hashmap

union find gpu?

Joins Towards Iterative Relational Algebra on the GPU. micinski usnenix ‘23


TACO TVM halide

accelerate repa futhark co-dfns

grobner gpu datalog gpu hvm sat inprocessing 3sat cuda resolution? term rewriting (the was that K/J webpage)

fluids molecular dynamics bioinformatics structure from motion CT scan something rendering

a grid of sho? celullar automata