GPU 코드 생성

Vais GPU Code Generation

Vais supports generating GPU compute shader code for CUDA, OpenCL, and WebGPU targets.

Quick Start

# Compile to CUDA
vaisc build kernel.vais --gpu cuda

# Compile to OpenCL
vaisc build kernel.vais --gpu opencl

# Compile to WebGPU WGSL
vaisc build kernel.vais --gpu webgpu

Writing GPU Kernels

Use the #[gpu] or #[kernel] attribute to mark functions as GPU kernels:

U std/gpu

# GPU kernel for vector addition
#[gpu]
F vector_add(a: *f64, b: *f64, c: *f64, n: i64) -> i64 {
    idx := global_idx()
    I idx < n {
        c[idx] = a[idx] + b[idx]
    }
    0
}

Supported Targets

CUDA (.cu)

For NVIDIA GPUs. Generates CUDA C code that can be compiled with nvcc.

vaisc build kernel.vais --gpu cuda -o kernel.cu
nvcc -c kernel.cu -o kernel.o

OpenCL (.cl)

Cross-platform GPU code. Works with NVIDIA, AMD, and Intel GPUs.

vaisc build kernel.vais --gpu opencl -o kernel.cl

WebGPU WGSL (.wgsl)

For browser-based GPU computing. Generates WGSL shaders.

vaisc build kernel.vais --gpu webgpu -o kernel.wgsl

GPU Built-in Functions

Thread Indexing

Function	Description
`thread_idx_x()`	Thread index within block (x)
`thread_idx_y()`	Thread index within block (y)
`thread_idx_z()`	Thread index within block (z)
`block_idx_x()`	Block index within grid (x)
`block_idx_y()`	Block index within grid (y)
`block_idx_z()`	Block index within grid (z)
`block_dim_x()`	Block dimension (x)
`global_idx()`	Global linear thread index

Synchronization

Function	Description
`sync_threads()`	Synchronize all threads in block
`thread_fence()`	Memory fence (global)

Atomic Operations

Function	Description
`atomic_add(addr, val)`	Atomic add
`atomic_sub(addr, val)`	Atomic subtract
`atomic_min(addr, val)`	Atomic minimum
`atomic_max(addr, val)`	Atomic maximum
`atomic_cas(addr, cmp, val)`	Compare-and-swap

Math Functions

Standard math functions (sqrt, sin, cos, exp, log, pow, etc.) are mapped to their GPU equivalents.

Examples

Vector Addition

U std/gpu

#[gpu]
F vector_add(a: *f64, b: *f64, c: *f64, n: i64) -> i64 {
    idx := global_idx()
    I idx < n {
        c[idx] = a[idx] + b[idx]
    }
    0
}

Generated CUDA:

__global__ void vector_add(double* a, double* b, double* c, long long n) {
    long long idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

Matrix Multiplication

U std/gpu

#[gpu]
F matmul(A: *f64, B: *f64, C: *f64, N: i64) -> i64 {
    row := block_idx_y() * block_dim_y() + thread_idx_y()
    col := block_idx_x() * block_dim_x() + thread_idx_x()

    I row < N && col < N {
        sum := 0.0
        k := 0
        L k < N {
            sum = sum + A[row * N + k] * B[k * N + col]
            k = k + 1
        }
        C[row * N + col] = sum
    }
    0
}

Reduction (Sum)

U std/gpu

#[gpu]
F reduce_sum(data: *f64, result: *f64, n: i64) -> i64 {
    # Shared memory for partial sums (per block)
    shared := shared_alloc(256 * 8)

    tid := thread_idx_x()
    idx := global_idx()

    # Load data
    I idx < n {
        shared[tid] = data[idx]
    } E {
        shared[tid] = 0.0
    }

    sync_threads()

    # Reduction in shared memory
    s := 128
    L s > 0 {
        I tid < s {
            shared[tid] = shared[tid] + shared[tid + s]
        }
        sync_threads()
        s = s / 2
    }

    # Write result
    I tid == 0 {
        atomic_add(result, shared[0])
    }
    0
}

Limitations

No closures: GPU kernels cannot use closures or captures
Limited types: Only primitive types and pointers supported
No recursion: GPU kernels cannot be recursive
No dynamic allocation: Use shared_alloc() for shared memory
Fixed function signatures: Kernel parameters must be pointers or primitives

Type Mapping

Vais Type	CUDA	OpenCL	WGSL
`i64`	`long long`	`long`	`i32`*
`i32`	`int`	`int`	`i32`
`f64`	`double`	`double`	`f32`*
`f32`	`float`	`float`	`f32`
`bool`	`bool`	`bool`	`bool`
`*T`	`T*`	`__global T*`	`ptr<storage, T>`

*WebGPU has limited 64-bit support

Performance Tips

Coalesced memory access: Access memory in stride-1 pattern
Shared memory: Use shared_alloc() for frequently accessed data
Avoid divergence: Minimize branching within warps
Occupancy: Choose block sizes that maximize GPU utilization
Memory transfers: Minimize host-device data transfers

Future Enhancements

Automatic memory transfer code generation
Texture and surface memory support
CUDA cooperative groups
Multi-GPU support
Performance profiling integration

Vais Programming Language