Optimizing Hierarchical Algorithms for GPGPUs
MetadataShow full item record
The performance potential of future architectures, thanks to Moores Law, grows linearly with the number of available devices per integrated circuit. Whether these future devices are ultra-small CMOS transistors, nano-tubes, or even individual molecules, it is clearly understood that there will be many of them available to computer architects. If current architecture trends are a good indicator of future designs, likely many of these devices will be allocated as extra cores on chip-multicore systems. However, the nature of highly parallel processors consisting of ultra-small devices brings along with it some inherent difficulties. Between the complexity of programming multiprocessor systems, increased power consumption, and higher fault rate for these tiny devices, future architects will have their work cut out for them. However, recent advances in neuroscientific understanding make parallel computing devices modeled after the human neocortex a plausible, attractive, fault-tolerant, and energy-efficient possibility. In this paper we describe a GPGPU-accelerated extension to an intelligent model based on the mammalian neocortex. Our cortical architecture, like the human brain, exhibits massive amounts of processing parallelism, making todays GPGPUs a highly attractive and readily- available hardware accelerator for such a model. Using NVIDIAs CUDA framework, we have achieved up to 330x speedup over an unoptimized C++ serial implementation. We also con- sider two inefficiencies inherent to our initial design: multiple kernel-launch overhead and poor utilization of GPGPU resources. We propose using a software work-queue structure to solve the former, and pipelining the cortical architecture during training phase for the latter. We also investigate applying these techniques to a few CUDA applications that exhibit some structural similarities to our cortical architecture model. Additionally, from our success in extending our model to the GPU, we estimate the hardware requirements for simulating the computational abilities of mammalian brains.