GPU API Reference

GPU compute support for CUDA, Metal, OpenCL with host-side runtime management

Import

U std/gpu

Overview

The GPU module provides two categories of functions:

  1. Kernel-side intrinsics: Thread indexing, synchronization, atomics, math - replaced by GPU codegen when compiling with --gpu
  2. Host-side runtime API: Memory allocation, kernel launching, device management - linked from gpu_runtime.c

Compile with: vaisc build file.vais --gpu cuda --gpu-compile

Thread Indexing Functions

Basic Thread/Block Indices

FunctionSignatureDescription
thread_idx_xF thread_idx_x() -> i64Thread index within block (x)
thread_idx_yF thread_idx_y() -> i64Thread index within block (y)
thread_idx_zF thread_idx_z() -> i64Thread index within block (z)
block_idx_xF block_idx_x() -> i64Block index within grid (x)
block_idx_yF block_idx_y() -> i64Block index within grid (y)
block_idx_zF block_idx_z() -> i64Block index within grid (z)

Dimensions

FunctionSignatureDescription
block_dim_xF block_dim_x() -> i64Threads per block (x)
block_dim_yF block_dim_y() -> i64Threads per block (y)
block_dim_zF block_dim_z() -> i64Threads per block (z)
grid_dim_xF grid_dim_x() -> i64Blocks per grid (x)
grid_dim_yF grid_dim_y() -> i64Blocks per grid (y)
grid_dim_zF grid_dim_z() -> i64Blocks per grid (z)

Global Indexing

FunctionSignatureDescription
global_idxF global_idx() -> i64Global thread index (1D)
global_idx_xF global_idx_x() -> i64Global thread index (2D x)
global_idx_yF global_idx_y() -> i64Global thread index (2D y)

Synchronization Functions

FunctionSignatureDescription
sync_threadsF sync_threads() -> i64Block-level barrier (all threads)
thread_fenceF thread_fence() -> i64Global memory fence
thread_fence_blockF thread_fence_block() -> i64Shared memory fence

Atomic Operations

FunctionSignatureDescription
atomic_addF atomic_add(addr: *i64, val: i64) -> i64Atomic add, returns old value
atomic_add_f64F atomic_add_f64(addr: *f64, val: f64) -> f64Atomic add for f64
atomic_subF atomic_sub(addr: *i64, val: i64) -> i64Atomic subtract
atomic_minF atomic_min(addr: *i64, val: i64) -> i64Atomic minimum
atomic_maxF atomic_max(addr: *i64, val: i64) -> i64Atomic maximum
atomic_andF atomic_and(addr: *i64, val: i64) -> i64Atomic bitwise AND
atomic_orF atomic_or(addr: *i64, val: i64) -> i64Atomic bitwise OR
atomic_xorF atomic_xor(addr: *i64, val: i64) -> i64Atomic bitwise XOR
atomic_casF atomic_cas(addr: *i64, compare: i64, val: i64) -> i64Compare-and-swap
atomic_exchF atomic_exch(addr: *i64, val: i64) -> i64Atomic exchange

GPU Math Functions

FunctionSignatureDescription
gpu_sqrtF gpu_sqrt(x: f64) -> f64Fast square root
gpu_rsqrtF gpu_rsqrt(x: f64) -> f64Fast reciprocal square root
gpu_sinF gpu_sin(x: f64) -> f64Fast sine
gpu_cosF gpu_cos(x: f64) -> f64Fast cosine
gpu_expF gpu_exp(x: f64) -> f64Fast exponential
gpu_logF gpu_log(x: f64) -> f64Fast logarithm
gpu_fmaF gpu_fma(a: f64, b: f64, c: f64) -> f64Fused multiply-add: a*b+c

Shared Memory

FunctionSignatureDescription
shared_allocF shared_alloc(size: i64) -> i64Allocate shared memory (per-block)

Memory Operations

FunctionSignatureDescription
gpu_loadF gpu_load(addr: *f64) -> f64Coalesced load from global memory
gpu_storeF gpu_store(addr: *f64, val: f64) -> i64Coalesced store to global memory

Utility Functions

FunctionSignatureDescription
gpu_clampF gpu_clamp(x: f64, lo: f64, hi: f64) -> f64Clamp value to range
gpu_lerpF gpu_lerp(a: f64, b: f64, t: f64) -> f64Linear interpolation
gpu_stepF gpu_step(edge: f64, x: f64) -> f64Step function
gpu_smoothstepF gpu_smoothstep(edge0: f64, edge1: f64, x: f64) -> f64Smooth Hermite interpolation

Warp/Wavefront Operations

FunctionSignatureDescription
lane_idF lane_id() -> i64Lane index within warp (0-31 or 0-63)
warp_allF warp_all(condition: i64) -> i64True if all lanes have condition true
warp_anyF warp_any(condition: i64) -> i64True if any lane has condition true
warp_ballotF warp_ballot(condition: i64) -> i64Bitmask of lanes with condition true
warp_shuffleF warp_shuffle(val: i64, src_lane: i64) -> i64Get value from another lane
warp_shuffle_downF warp_shuffle_down(val: i64, delta: i64) -> i64Get value from lane + delta
warp_shuffle_upF warp_shuffle_up(val: i64, delta: i64) -> i64Get value from lane - delta
warp_shuffle_xorF warp_shuffle_xor(val: i64, mask: i64) -> i64Get value from lane ^ mask

Block Reduction Operations

FunctionSignatureDescription
block_reduce_sumF block_reduce_sum(val: f64) -> f64Block-level sum reduction
block_reduce_maxF block_reduce_max(val: f64) -> f64Block-level max reduction
block_reduce_minF block_reduce_min(val: f64) -> f64Block-level min reduction

Grid Configuration Helpers

FunctionSignatureDescription
calc_blocksF calc_blocks(n: i64, block_size: i64) -> i64Calculate blocks needed for n elements
calc_threadsF calc_threads(n: i64, block_size: i64) -> i64Calculate total threads for n elements

Struct

KernelConfig

Configure kernel launch parameters.

Fields:

  • grid_x: i64 - Grid dimension x
  • grid_y: i64 - Grid dimension y
  • grid_z: i64 - Grid dimension z
  • block_x: i64 - Block dimension x
  • block_y: i64 - Block dimension y
  • block_z: i64 - Block dimension z
  • shared_memory: i64 - Shared memory bytes
FunctionSignatureDescription
kernel_config_defaultF kernel_config_default() -> KernelConfigDefault config (1 block, 256 threads)
kernel_config_1dF kernel_config_1d(n: i64, block_size: i64) -> KernelConfig1D kernel config
kernel_config_2dF kernel_config_2d(width: i64, height: i64, block_x: i64, block_y: i64) -> KernelConfig2D kernel config

Host-Side API

Memory Management

FunctionSignatureDescription
gpu_allocF gpu_alloc(size: i64) -> *i64Allocate GPU device memory
gpu_freeF gpu_free(ptr: *i64) -> i64Free GPU device memory
gpu_memcpy_h2dF gpu_memcpy_h2d(dst: *i64, src: *i64, size: i64) -> i64Copy host to device
gpu_memcpy_d2hF gpu_memcpy_d2h(dst: *i64, src: *i64, size: i64) -> i64Copy device to host
gpu_memcpy_d2dF gpu_memcpy_d2d(dst: *i64, src: *i64, size: i64) -> i64Copy device to device
gpu_memsetF gpu_memset(ptr: *i64, value: i64, size: i64) -> i64Set device memory to value
gpu_alloc_managedF gpu_alloc_managed(size: i64) -> *i64Allocate unified/managed memory

Kernel Execution

FunctionSignatureDescription
gpu_launch_kernelF gpu_launch_kernel(kernel_func: *i64, grid_x: i64, grid_y: i64, grid_z: i64, block_x: i64, block_y: i64, block_z: i64, shared_mem: i64, args: *i64, arg_count: i64) -> i64Launch CUDA kernel
gpu_synchronizeF gpu_synchronize() -> i64Wait for all GPU operations

Stream Management

FunctionSignatureDescription
gpu_stream_createF gpu_stream_create() -> *i64Create CUDA stream
gpu_stream_destroyF gpu_stream_destroy(stream: *i64) -> i64Destroy stream
gpu_stream_synchronizeF gpu_stream_synchronize(stream: *i64) -> i64Synchronize stream

Device Management

FunctionSignatureDescription
gpu_device_countF gpu_device_count() -> i64Get number of CUDA devices
gpu_set_deviceF gpu_set_device(device_id: i64) -> i64Set active device
gpu_get_deviceF gpu_get_device() -> i64Get current device ID
gpu_device_nameF gpu_device_name(device_id: i64) -> *i8Get device name
gpu_device_total_memF gpu_device_total_mem(device_id: i64) -> i64Get total device memory
gpu_device_max_threadsF gpu_device_max_threads(device_id: i64) -> i64Get max threads per block

Event Timing

FunctionSignatureDescription
gpu_event_createF gpu_event_create() -> *i64Create CUDA event
gpu_event_destroyF gpu_event_destroy(event: *i64) -> i64Destroy event
gpu_event_recordF gpu_event_record(event: *i64) -> i64Record event
gpu_event_synchronizeF gpu_event_synchronize(event: *i64) -> i64Wait for event
gpu_event_elapsedF gpu_event_elapsed(start: *i64, end: *i64) -> f64Get elapsed time (ms) between events
gpu_event_record_streamF gpu_event_record_stream(event: *i64, stream: *i64) -> i64Record event on stream

Async Memory Transfer

FunctionSignatureDescription
gpu_memcpy_h2d_asyncF gpu_memcpy_h2d_async(dst: *i64, src: *i64, size: i64, stream: *i64) -> i64Async host-to-device copy
gpu_memcpy_d2h_asyncF gpu_memcpy_d2h_async(dst: *i64, src: *i64, size: i64, stream: *i64) -> i64Async device-to-host copy

Unified Memory Hints

FunctionSignatureDescription
gpu_mem_prefetchF gpu_mem_prefetch(ptr: *i64, size: i64, device_id: i64) -> i64Prefetch unified memory to device
gpu_mem_adviseF gpu_mem_advise(ptr: *i64, size: i64, advice: i64, device_id: i64) -> i64Advise memory access pattern

Multi-GPU Peer Access

FunctionSignatureDescription
gpu_peer_access_enableF gpu_peer_access_enable(peer_device: i64) -> i64Enable peer-to-peer access
gpu_peer_access_disableF gpu_peer_access_disable(peer_device: i64) -> i64Disable peer-to-peer access
gpu_peer_can_accessF gpu_peer_can_access(device: i64, peer: i64) -> i64Check if peer access possible
gpu_memcpy_peerF gpu_memcpy_peer(dst: *i64, dst_device: i64, src: *i64, src_device: i64, size: i64) -> i64Copy between devices

Error Handling

FunctionSignatureDescription
gpu_last_errorF gpu_last_error() -> i64Get last CUDA error code (0=success)
gpu_last_error_stringF gpu_last_error_string() -> *i8Get last error as string
gpu_reset_errorF gpu_reset_error() -> i64Reset/clear last error

Metal-Specific Functions (Apple GPU)

FunctionSignatureDescription
threadgroup_barrierF threadgroup_barrier() -> i64Threadgroup memory barrier
device_barrierF device_barrier() -> i64Device memory barrier
simd_sumF simd_sum(val: f64) -> f64SIMD group sum
simd_minF simd_min(val: f64) -> f64SIMD group minimum
simd_maxF simd_max(val: f64) -> f64SIMD group maximum
simd_broadcastF simd_broadcast(val: f64, lane: i64) -> f64Broadcast from lane
quad_sumF quad_sum(val: f64) -> f64Quad (4-wide) sum
quad_broadcastF quad_broadcast(val: f64, lane: i64) -> f64Quad broadcast

AVX-512 SIMD Operations (Intel/AMD)

Load/Store (512-bit vectors)

FunctionSignatureDescription
avx512_load_f32F avx512_load_f32(addr: *i64) -> i64Load 16 x f32
avx512_store_f32F avx512_store_f32(addr: *i64, vec: i64) -> i64Store 16 x f32
avx512_load_f64F avx512_load_f64(addr: *f64) -> i64Load 8 x f64
avx512_store_f64F avx512_store_f64(addr: *f64, vec: i64) -> i64Store 8 x f64

Arithmetic

FunctionSignatureDescription
avx512_add_f32F avx512_add_f32(a: i64, b: i64) -> i64Vector add
avx512_sub_f32F avx512_sub_f32(a: i64, b: i64) -> i64Vector subtract
avx512_mul_f32F avx512_mul_f32(a: i64, b: i64) -> i64Vector multiply
avx512_div_f32F avx512_div_f32(a: i64, b: i64) -> i64Vector divide
avx512_fma_f32F avx512_fma_f32(a: i64, b: i64, c: i64) -> i64Vector FMA

Reduction

FunctionSignatureDescription
avx512_reduce_add_f32F avx512_reduce_add_f32(vec: i64) -> f64Horizontal sum
avx512_reduce_min_f32F avx512_reduce_min_f32(vec: i64) -> f64Horizontal minimum
avx512_reduce_max_f32F avx512_reduce_max_f32(vec: i64) -> f64Horizontal maximum

Broadcast

FunctionSignatureDescription
avx512_broadcast_f32F avx512_broadcast_f32(val: f64) -> i64Broadcast f32 to vector
avx512_broadcast_f64F avx512_broadcast_f64(val: f64) -> i64Broadcast f64 to vector

AVX2 SIMD Operations (Intel/AMD)

Load/Store (256-bit vectors)

FunctionSignatureDescription
avx2_load_f32F avx2_load_f32(addr: *i64) -> i64Load 8 x f32
avx2_store_f32F avx2_store_f32(addr: *i64, vec: i64) -> i64Store 8 x f32
avx2_load_f64F avx2_load_f64(addr: *f64) -> i64Load 4 x f64
avx2_store_f64F avx2_store_f64(addr: *f64, vec: i64) -> i64Store 4 x f64

Arithmetic

FunctionSignatureDescription
avx2_add_f32F avx2_add_f32(a: i64, b: i64) -> i64Vector add
avx2_sub_f32F avx2_sub_f32(a: i64, b: i64) -> i64Vector subtract
avx2_mul_f32F avx2_mul_f32(a: i64, b: i64) -> i64Vector multiply
avx2_fma_f32F avx2_fma_f32(a: i64, b: i64, c: i64) -> i64Vector FMA

Broadcast

FunctionSignatureDescription
avx2_broadcast_f32F avx2_broadcast_f32(val: f64) -> i64Broadcast f32 to vector

ARM NEON SIMD Operations

Load/Store (128-bit vectors)

FunctionSignatureDescription
neon_load_f32F neon_load_f32(addr: *i64) -> i64Load 4 x f32
neon_store_f32F neon_store_f32(addr: *i64, vec: i64) -> i64Store 4 x f32
neon_load_f64F neon_load_f64(addr: *f64) -> i64Load 2 x f64
neon_store_f64F neon_store_f64(addr: *f64, vec: i64) -> i64Store 2 x f64

Arithmetic

FunctionSignatureDescription
neon_add_f32F neon_add_f32(a: i64, b: i64) -> i64Vector add
neon_sub_f32F neon_sub_f32(a: i64, b: i64) -> i64Vector subtract
neon_mul_f32F neon_mul_f32(a: i64, b: i64) -> i64Vector multiply
neon_fma_f32F neon_fma_f32(a: i64, b: i64, c: i64) -> i64Vector FMA

Reduction

FunctionSignatureDescription
neon_reduce_add_f32F neon_reduce_add_f32(vec: i64) -> f64Horizontal sum
neon_reduce_min_f32F neon_reduce_min_f32(vec: i64) -> f64Horizontal minimum
neon_reduce_max_f32F neon_reduce_max_f32(vec: i64) -> f64Horizontal maximum

Broadcast

FunctionSignatureDescription
neon_dup_f32F neon_dup_f32(val: f64) -> i64Duplicate f32 to vector

Usage

Basic Vector Addition Kernel

U std/gpu

#[gpu]
F vector_add(a: *f64, b: *f64, c: *f64, n: i64) -> i64 {
    idx := global_idx()
    I idx < n {
        c[idx] = a[idx] + b[idx]
    }
    0
}

F main() -> i64 {
    n := 1000000
    size := n * 8  # 8 bytes per f64

    # Allocate device memory
    d_a := gpu_alloc(size)
    d_b := gpu_alloc(size)
    d_c := gpu_alloc(size)

    # Copy input data
    gpu_memcpy_h2d(d_a, host_a, size)
    gpu_memcpy_h2d(d_b, host_b, size)

    # Launch kernel
    block_size := 256
    grid_size := calc_blocks(n, block_size)
    gpu_launch_kernel(vector_add, grid_size, 1, 1, block_size, 1, 1, 0, [d_a, d_b, d_c, n], 4)

    # Copy results back
    gpu_memcpy_d2h(host_c, d_c, size)

    # Synchronize and cleanup
    gpu_synchronize()
    gpu_free(d_a)
    gpu_free(d_b)
    gpu_free(d_c)

    0
}

Matrix Multiplication (2D Grid)

U std/gpu

#[gpu]
F matmul(A: *f64, B: *f64, C: *f64, N: i64) -> i64 {
    row := global_idx_y()
    col := global_idx_x()

    I row < N && col < N {
        sum := 0.0
        k := 0
        L k < N {
            sum = sum + A[row * N + k] * B[k * N + col]
            k = k + 1
        }
        C[row * N + col] = sum
    }
    0
}

F main() -> i64 {
    N := 1024
    config := kernel_config_2d(N, N, 16, 16)

    # Launch 2D kernel
    gpu_launch_kernel(matmul, config.grid_x, config.grid_y, 1,
                      config.block_x, config.block_y, 1,
                      0, [d_A, d_B, d_C, N], 4)

    gpu_synchronize()
    0
}

Using Shared Memory

U std/gpu

#[gpu]
F reduce_sum(input: *f64, output: *f64, n: i64) -> i64 {
    tid := thread_idx_x()
    idx := global_idx()

    # Allocate shared memory
    shared := shared_alloc(256 * 8) as *f64

    # Load into shared memory
    I idx < n {
        shared[tid] = input[idx]
    } ! {
        shared[tid] = 0.0
    }

    sync_threads()

    # Reduction in shared memory
    stride := 128
    L stride > 0 {
        I tid < stride {
            shared[tid] = shared[tid] + shared[tid + stride]
        }
        sync_threads()
        stride = stride / 2
    }

    # Write result
    I tid == 0 {
        output[block_idx_x()] = shared[0]
    }

    0
}

Atomic Operations

U std/gpu

#[gpu]
F histogram(data: *i64, bins: *i64, n: i64, num_bins: i64) -> i64 {
    idx := global_idx()
    I idx < n {
        bin := data[idx] % num_bins
        atomic_add(&bins[bin], 1)
    }
    0
}

Warp-Level Reduction

U std/gpu

#[gpu]
F warp_reduce(input: *f64, output: *f64, n: i64) -> i64 {
    idx := global_idx()
    val := I idx < n { input[idx] } ! { 0.0 }

    # Warp-level shuffle reduction
    val = val + warp_shuffle_down(val, 16)
    val = val + warp_shuffle_down(val, 8)
    val = val + warp_shuffle_down(val, 4)
    val = val + warp_shuffle_down(val, 2)
    val = val + warp_shuffle_down(val, 1)

    # First lane writes result
    I lane_id() == 0 {
        output[block_idx_x() * 32 + thread_idx_x() / 32] = val
    }

    0
}

Stream-Based Async Execution

U std/gpu

F main() -> i64 {
    # Create streams
    stream1 := gpu_stream_create()
    stream2 := gpu_stream_create()

    # Async copies and kernels
    gpu_memcpy_h2d_async(d_a1, h_a1, size, stream1)
    gpu_memcpy_h2d_async(d_a2, h_a2, size, stream2)

    # Launch on different streams
    gpu_launch_kernel_stream(kernel1, grid, block, stream1, args1)
    gpu_launch_kernel_stream(kernel2, grid, block, stream2, args2)

    # Async copy results
    gpu_memcpy_d2h_async(h_c1, d_c1, size, stream1)
    gpu_memcpy_d2h_async(h_c2, d_c2, size, stream2)

    # Synchronize streams
    gpu_stream_synchronize(stream1)
    gpu_stream_synchronize(stream2)

    # Cleanup
    gpu_stream_destroy(stream1)
    gpu_stream_destroy(stream2)

    0
}

GPU Timing with Events

U std/gpu

F main() -> i64 {
    start := gpu_event_create()
    stop := gpu_event_create()

    gpu_event_record(start)

    # Launch kernel
    gpu_launch_kernel(my_kernel, grid, block, 0, args, arg_count)

    gpu_event_record(stop)
    gpu_event_synchronize(stop)

    elapsed_ms := gpu_event_elapsed(start, stop)

    gpu_event_destroy(start)
    gpu_event_destroy(stop)

    0
}

Notes

  • Kernel-side functions (thread_idx_, atomic_, etc.) are replaced by the GPU codegen backend. Host-side placeholders return dummy values.
  • Host-side functions (gpu_alloc, gpu_launch_kernel, etc.) are extern C functions linked from gpu_runtime.c.
  • Compile with --gpu cuda for NVIDIA, --gpu metal for Apple, or --gpu opencl for cross-platform.
  • Memory pointers returned by gpu_alloc are device pointers and cannot be dereferenced on the host.
  • Always call gpu_synchronize() before reading results back to the host.
  • SIMD functions (AVX-512, AVX2, NEON) are CPU-side optimizations, not GPU kernels.