We aim to implement a parallel cloth simulation using a Mass-Spring model. The simulation will run efficiently on GPUs using CUDA to accelerate computations. We started from the open-source library cuda_clothes, implemented basic self-collision detection, introduced a grid-based broad-phase collision detection method, and an iterative refinement strategy to enhance robustness.
In this project, we plan to perform a cloth simulation on an irregular 3D surface, which may also include collision simulations that require dynamic task allocation techniques. The simulation involves not only handling the deformation and forces acting on the cloth but also collision detection and response with surrounding objects. During the simulation, the system needs to update forces (such as gravity, elasticity, and air resistance) on every vertex of the cloth in real-time or near real-time, as well as detect and handle collisions between cloth vertices or triangles and external objects.
Because the cloth is typically composed of a large number of mesh vertices and triangular faces, and every vertex may be subject to distinct constraints and collisions when placed on an irregular surface, the computational workload can be substantial. Moreover, collision detection often requires repeated, intensive spatial queries for different regions or objects, with computational cost growing rapidly as the cloth mesh or scene complexity increases. To address this challenge, we plan to split the cloth simulation into multiple smaller tasks that can be processed in parallel, such as partitioning the cloth into sub-regions for force and collision computations, and dynamically merging or splitting these sub-tasks depending on the simulation load. This approach will make full use of multi-core or multi-threaded architectures to increase overall simulation performance.
We have updated the workflow to include synchronization, spatial grid construction, and iterative collision detection refinement:
initialize cloth mesh
while simulation running:
    update vertex positions (synchronized)
    construct/update spatial grid
    repeat until stable:
        detect collisions using spatial grid
        resolve collisions
    synchronize updated mesh state
render simulation results
        By partitioning the cloth mesh into smaller sub-regions and processing them in parallel, each thread or computational unit only needs to handle a limited set of vertices or triangles, thus reducing the computational cost and memory access overhead for collision detection and other tasks. Furthermore, as the cloth moves or collision demands change, we can dynamically adjust the partitioning and scheduling strategy to prevent any thread from being either underutilized or overloaded, and thereby further optimize overall performance.
One of the main challenges of this cloth simulation project is the inherently irregular and dynamic nature of the problem. Each cloth vertex’s movement and collisions depend not only on its immediate neighbors for forces such as tension and bending but also on the state of the environment, which can vary widely across the surface. This leads to complex dependencies where results in one region of the cloth can impact calculations in another region. Additionally, frequent and unpredictable collision checks introduce irregular memory access patterns: data may need to be fetched from disparate memory locations, complicating caching and reducing spatial and temporal locality. As a result, the communication-to-computation ratio can become high if each sub-region must constantly exchange boundary information or resolve interactions with other sub-regions.
These factors make it difficult to parallelize the workload in a straightforward manner. Some parts of the cloth may experience heavy collisions or intricate deformations (thus requiring extensive computation), while others remain relatively stable—leading to divergent execution. Balancing this load evenly across multiple threads or computational units becomes a non-trivial task. Moreover, dynamically reallocating sub-regions or tasks at runtime to accommodate changing collision patterns introduces additional overhead and synchronization demands.
Mapping the workload to a parallel system is further constrained by the need to maintain a globally consistent state. Each time step must correctly account for updates in all sub-regions, and any collisions across boundaries of these sub-regions must be handled without race conditions or data inconsistencies. On hardware with limited shared memory or certain GPU architectures, the irregular access patterns and sudden spikes in computational requirements can cause bottlenecks and underutilized resources. Consequently, one of the core lessons to be learned in this project is how to design dynamic, load-balanced partitioning strategies and synchronization mechanisms that accommodate frequent changes in the simulation state while still achieving efficient parallelization.
Additional concerns include collision tunneling and balancing accuracy with efficiency.<\p>
We plan to begin with a simple parallel cloth simulation code base—likely an open-source project that already supports basic multi-threading—so that we can focus our efforts on implementing and evaluating our own dynamic partitioning and task allocation strategies. We will run and test our code on standard multi-core CPUs and potentially on a GPU-equipped workstation if it proves beneficial for accelerating certain aspects of the simulation. As a reference, we will draw on established literature in cloth simulation, such as Baraff and Witkin’s seminal paper “Large Steps in Cloth Simulation” (Baraff, D., & Witkin, A. (1998). Large steps in cloth simulation. ACM SIGGRAPH), to inform our approaches to collision handling and numerical integration. Beyond these resources, we do not currently anticipate needing access to specialized hardware beyond high-core-count machines or general-purpose GPUs, though we remain open to exploring more advanced HPC clusters if our simulation’s requirements grow.
These are the core objectives that we must accomplish to consider our project successful and meet our expected grading criteria:
If the project progresses ahead of schedule, we aim to explore additional enhancements:
If feasible, we plan to showcase a live demo during the poster session:
We have chosen a standard multi-core CPU platform (with optional GPU acceleration) and a C++-based simulation framework because this setup aligns well with our parallelization goals and performance requirements. (Possibly GHC machine or our own computer)
C++ offers efficient low-level control over data structures and memory management, which is critical for irregular workloads like cloth simulation. Multi-core CPUs (and GPUs, if available) provide the parallel computing capabilities we need to divide and distribute the simulation tasks across multiple threads or compute units. Additionally, existing libraries and parallel frameworks (such as OpenMP, TBB, or CUDA) allow us to quickly prototype and optimize various load balancing and synchronization strategies. This combination makes it feasible to implement our dynamic partitioning approach, manage unpredictable collision workloads, and still maintain high performance and scalability.