Designing Efficient Barriers and Semaphores for Graphics Processing Units
MetadataShow full item record
General-purpose GPU applications that use fine-grained synchronization to enforce ordering between many threads accessing shared data have become increasingly popular. Thus, it is imperative to create more efficient GPU synchronization primitives for these applications. Accordingly, in recent years there has been a push to establish a single, unified set of GPU synchronization primitives. However, unlike CPUs, modern GPUs poorly support synchronization primitives. In particular, inefficient support for atomics, which are used to implement fine-grained synchronization, make it challenging to implement efficient algorithms. Therefore, as GPU algorithms are scaled to millions or billions of threads, existing GPU synchronization primitives either scale poorly or suffer from livelock or deadlock issues because of increased contention between threads accessing the same shared synchronization objects. In this work, we seek to overcome these inefficiencies by designing more efficient, scalable GPU global barriers and semaphores. In particular, we show how multi-level sense reversing barriers and priority mechanisms for semaphores can be extended from prior CPU implementations and applied to the GPUs unique processing model in order to improve performance and scalability of GPU synchronization primitives. Our results show that proposed designs significantly improve performance compared to state-of-the-art solutions like CUDA Cooperative Groups, and scale to an order of magnitude more threads – avoiding livelock as the algorithms scale compared to prior open source algorithms. Overall, across three modern GPUs: the proposed barrier implementation reduces atomic traffic by 50% and improves performance by an average of 26% over a GPU tree barrier algorithm and improves performance by an average of 30% over CUDA Cooperative Groups for four full-sized benchmarks; the new semaphore implementation improves performance by an average of 65% compared to prior GPU semaphore implementations.