FAI Summit research track brings together entrepreneurs, researchers, and representatives from the world's leading organizations to foster a community of Flexible, Accessible, Sustainable, Transformable AI – FAST AI. The research track will feature 12 talks of the latest AI technologies. The track also has a panel discussion and virtual social networking event where attendees will have a chance to engage in meaningful conversations on acceleration of project development leveraging infrastructure of the FAST AI community.
Keynote: SHREC is an NSF Center for Space, High-Performance, and Resilient Computing. In this talk, we will give an overview of the SHREC Center, its composition and its mission; and present an overview of four active projects at the University of Florida site of SHREC, under the umbrella of Heterogeneous Computing for Data Science: Compute Cache Hierarchy: focusing on Compute-near-Memory technologies such as FPGAs and Compute-in-Memory technologies such as PIM (Process-in-Memory) and IPU (Intelligent Processing Unit) devices. Heterogeneous Pont Cloud Net (HgPCN): a heterogeneous architecture for embedded 3-D point-cloud inference which aims to satisfy the stringent real-time requirements of applications on the computing edge. Productive Computational Science (PCS) Platform: which provides a programming abstraction that is accelerator-system agnostic, focusing on scalability and productivity to meet the demands of rapidly changing AI workloads and heterogeneous architectures. End-to-end ML Pipeline: leveraging Intel’s AI Analytics Toolkit to develop an end-to-end ML pipeline using oneAPI.
Bio: Herman Lam is an Associate Professor of Electrical and Computer Engineering at the University of Florida. Currently, his main research interest is in heterogeneous computing (HGC) and reconfigurable computing (RC), focusing on methods and tools for the acceleration and deployment of scientifically impactful applications on scalable RC and HGC systems. He was a Co-PI of the 2012 Alexander Schwarzkopf Prize for Technology Innovation from the National Science Foundation for “Novo-G: An innovative and synergistic research project and the world’s most powerful reconfigurable supercomputer”. Dr. Lam has authored or co-authored over 175 refereed conference and journal articles and one textbook. He served as the Associate Director of CHREC, the NSF Center for High-Performance Reconfigurable Computing. Currently, Dr. Lam is the University of Florida Site Director of the NSF Center for Space, High-Performance, and Resilient Computing. Academically, Dr. Lam was the Director of the Computer Engineering undergraduate program in the College of Engineering at the University of Florida from 2012-2021.
Contact information: http://www.hlam.ece.ufl.edu/ hlam@ufl.edu
Abstract: By providing highly efficient one-sided communication with globally shared memory space, Partitioned Global Address Space (PGAS) has become one of the most promising parallel computing models in high-performance computing (HPC). Meanwhile, FPGA is getting attention as an alternative compute platform for HPC systems with the benefit of custom computing and design flexibility. However, the exploration of PGAS has not been conducted on FPGAs, unlike the traditional message passing interface. This paper proposes FSHMEM, a software/hardware framework that enables the PGAS programming model on FPGAs. We implement the core functions of GASNet specification on FPGA for native PGAS integration in hardware, while its programming interface is designed to be highly compatible with legacy software. Our experiments show that FSHMEM achieves the peak bandwidth of 3813 MB/s, which is more than 95% of the theoretical maximum, outperforming the prior works by 9.5×. It records 0.35us and 0.59us latency for remote write and read operations, respectively. Finally, we conduct a case study on the two Intel D5005 FPGA nodes integrating Intel's deep learning accelerator. The two-node system programmed by FSHMEM achieves 1.94× and 1.98× speedup for matrix multiplication and convolution operation, respectively, showing its scalability potential for HPC infrastructure.
Bio: Yashael Faith Arthanto is from Indonesia. He received a BS degree from Bandung Institute of Technology Indonesia in 2019. He recently earned a MS degree in EEE at KAIST, South Korea in 2022. Now, he works for an AI chip startup called rebellions in South Korea.
Contact Information: yashael.faith@alumni.kaist.ac.kr
Abstract: Graph neural networks (GNNs) are becoming increasingly important in many applications such as social science, natural science, and autonomous driving. Driven by real-time inference requirement, GNN acceleration has became a key research topic. Given the largely diverse GNN model types, such as graph convolution network, graph attention network, graph isomorphic network, with arbitrary aggravation methods and edge attributes, designing a generic GNN accelerator is challenging. In this talk, we discuss our proposed generic and efficient GNN accelerator, called FlowGNN, which can easily accommodate a wide range of GNN types. Without losing generality, FlowGNN can outperform CPU and GPU by up to 400 times. In addition, we discuss an open-source automation flow, GNNBuilder, which allows users to design their own GNNs in PyTorch and then automatically generates the accelerator code targeting FPGA.
Bio: Dr. Cong (Callie) Hao is an assistant professor in ECE at Georgia Tech. She was a postdoctoral fellow at Georgia Tech from 2020-2021 and at UIUC from 2018-2020. She received the Ph.D. degree in Electrical Engineering from Waseda University in 2017, and the M.S. and B.S. degrees in Computer Science and Engineering from Shanghai Jiao Tong University. Her primary research interests lie in the joint area of efficient hardware design and machine learning algorithms, including software/hardware co-design for reconfigurable and high-efficiency computing and agile electronic design automation tools.
Contact information: https://sharclab.ece.gatech.edu/ callie.hao@ece.gatech.edu
Abstract: Hardware accelerator can help data scientists and ML engineers run much faster their applications but deploying these hardware accelerators was quite challenging until now. In this talk we will show how ML developers can utilize the power of the hardware accelerators like FPGA with zero code changes. FPGAs are adaptable hardware platforms that can offer great performance, low-latency and reduced OpEx for applications like machine learning. We will show how users can enjoy the performance of hardware accelerators and at the same time enjoy the easy of deployment like an other computing platform.
Bio: Christoforos Kachris is the co-founder and CEO of InAccel that helps companies speedup their AI/ML applications using hardware accelerators (FPGAs) in the cloud or on-prem. Christoforos holds a Ph.D. on Computer Engineering from Delft University of Technology and he has more than 20 years of experience on hardware acceleration. He is the editor of the “Hardware Accelerators in Data Centers” and co-author of more than 80 scientific peer-reviewed publications on FPGA-based hardware acceleration (with more than 2400 citations). He was the supervisor of 3 winners on the international Open Hardware contest for his contribution on ML acceleration in 2018 and 2020.
Contact Information: https://inaccel.com/ chris@inaccel.com
Abstract: Many large-scale physics experiments, such as ATLAS at the Large Hadron Collider, Deep Underground Neutrino Experiment and sPHENIX at the Realistic Heavy Ion Collider, rely on accurate simulations to inform data analysis and derive scientific results. However, inevitable discrepancy between simulation and experiments requires corrections using heuristics in a conventional analysis workflow. It also prevents data-driven models, learned on simulation data, from inferring experiment data directly. Our goal is to develop machine learning methods that can bridge the gap between simulations and experiments. Our initial effort demonstrated the feasibility of such approach using a Vision Transformer augmented U-Net under the CycleGAN framework. In this talk, I will present our model (UVCGAN) and its applications on two tiers of data from Liquid Argon Time Projection Chamber simulations. UVCGAN is also competitive against other advanced image translation models on open benchmark data sets.
Bio: Yihui, a.k.a. "Ray", works in the general area of Artificial Intelligence (AI), its applications in science and its interaction with novel hardware. Ray's current research topics include unpaired image translation to bridge the gap between simulation and experiments, neural network optimization and deployment for real-time systems, novel hardware exploration and benchmarking, privacy-preserving AI, and bringing advanced AI methods to scientific domains.
Contact Information: https://www.bnl.gov/staff/yren yren@bnl.gov
Abstract: Large Language Models are shifting “what’s possible” in AI, but distributed training across thousands of traditional accelerators is massively complex and always suffers diminishing returns as more compute is added. Always? No longer. In this talk, I would go over the overview of Cerebras Wafer-Scale Cluster which involved the fundamental redesigning of chips, systems, compilers, workflow scaling, and beyond. I will present a cluster of 16 Cerebras CS-2 nodes that achieves near-perfect linear scaling across more cores than the world’s most powerful supercomputer with nearly 13 million AI cores.
Bio: Prashanth holds a Ph.D in Computer Science and Engineering from Penn State, and his research focused on systems aspects of high performance and cloud computing. He has authored several conference papers and a book chapter in the area. He is currently working for Cerebras systems as AI Cluster Infrastructure Engineer. He develops the systems that enable large-scale AI model training on Cerebras's Wafer-scale Cluster. This system recently made news to have trained the largest AI model on a single device and won the ACM Gordon Bell special prize for HPC COVID Research. News: https://www.cerebras.net/company/news/
Contact information: prashanth.thina@gmail.com
Abstract: From edge to AI and HPC, computer architectures are becoming more heterogeneous and complex. The systems typically have fat nodes, with multicore CPUs and multiple hardware accelerators such as GPUs, FPGAs, and DSPs. This complexity is causing a crisis in programming systems and performance portability. Several programming systems are working to address these challenges, but the increasing architectural diversity is forcing software stacks and applications to be specialized for each architecture, resulting in poor portability and productivity. This talk argues that a more agile, proactive, and intelligent runtime system is essential to increase performance portability and improve user productivity. To this end, this talk introduces a new runtime system called IRIS. IRIS enables programmers to write portable and flexible programs across diverse heterogeneous architectures for different application domains from embedded/mobile computing to AI and HPC computing, by orchestrating multiple programming platforms in a single execution and programming environment.
Bio: Seyong Lee is a Senior R&D Staff in Computer Science and Mathematics Division at Oak Ridge National Laboratory. His research interests include parallel programming and performance optimization in heterogeneous computing environments, program analysis, and optimizing compilers. He received his PhD in Electrical and Computer Engineering from Purdue University, USA. He is a member of the OpenACC Technical Committee and a former member of the Exascale Computing Project PathForward Working Group. He served as a program committee/guest editor/external reviewer for various conferences, journals, and proposals. His SC10 paper won the best student paper award, and his PPoPP09 paper was selected as the most cited paper among all papers published in PPoPP between 2009 and 2014. He received the IEEE Computer Society TCHPC Award for Excellence for Early Career Researchers in High Performance Computing at SC16 and served as an award committee member for 2017 IEEE CS TCHPC Award.
Contact information: lees2@ornl.gov http://ft.ornl.gov/~lees2/
Abstract: Deep learning technology has made significant progress on various cognitive tasks, once believed impossible for computers to do well as humans, including image classification, object detection, speech recognition, and natural language processing. However, the vast adaptation of deep learning also highlights its shortcomings, such as limited generalizability and lack of interpretability. In addition, application-specific deep learning models require lots of manually annotated training samples with sophisticated learning schemes. Witnessing the performance saturation of early models such as MLP, CNN, and RNN, one notable recent innovation in deep learning architecture is the transformer model introduced in 2017. It has two good properties towards artificial general intelligence over conventional models. First, the performance of transformer models continues to grow with their model sizes and training data. Second, transformers can be pre-trained with tons of unlabeled data either through unsupervised or self-supervised learning and can be fine-tuned quickly for each application. In this talk, I will present a multi-FPGA acceleration appliance named DFX for accelerating hyperscale transformer-based AI models. Optimized for OpenAI’s GPT (Generative Pre-trained Transformer) models, it manages to execute an end-to-end inference with low latency and high throughput. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among multiple devices. Its compute cores operate on custom instructions and support entire GPT operations including multi-head attentions, layer normalization, token embedding, and LM head. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. Finally, DFX achieves 5.58× speedup and 3.99× energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is also 8.21× more cost-effective than the GPU appliance, suggesting that it can be a promising alternative in cloud datacenters.
Bio: Joo-Young Kim received the B.S., M.S., and Ph. D degree in Electrical Engineering from Korea Advanced Institute of Science and Technology (KAIST), in 2005, 2007, and 2010, respectively. He is currently an Assistant Professor in the School of Electrical Engineering at KAIST. He is also the Director of AI Semiconductor Systems (AISS) research center. His research interests span various aspects of hardware design including VLSI design, computer architecture, FPGA, domain specific accelerators, hardware/software co-design, and agile hardware development. Before joining KAIST, Joo-Young was a Senior Hardware Engineering Lead at Microsoft Azure working on hardware acceleration for its hyper-scale big data analytics platform named Azure Data Lake. Before that, he was one of the initial members of Catapult project at Microsoft Research, where he deployed a fabric of FPGAs in datacenters to accelerate critical cloud services such as machine learning, data storage, and networking. Joo-Young is a recipient of the 2016 IEEE Micro Top Picks Award, the 2014 IEEE Micro Top Picks Award, the 2010 DAC/ISSCC Student Design Contest Award, the 2008 DAC/ISSCC Student Design Contest Award, and the 2006 A-SSCC Student Design Contest Award. He serves as Associate Editor for the IEEE Transactions on Circuits and Systems I: Regular Papers (2020-2021).
Contact Information: https://castlab.kaist.ac.kr/our-team/joo-young-kim/ jooyoung1203@kaist.ac.kr
Abstract: This talk will go over different AutoML methods that share the common themes of efficiency and hardware-awareness. We will present (1) a fast predictor-based search algorithm, (2) zero-cost NAS proxies that significantly speed up the evaluation phase of NAS, (3) accurate hardware latency prediction in NAS, (4) automated hardware-DNN codesign, and (5) real application case studies by which NAS delivered significant improvements. Our projects are all parts of an effort to enable on-device deployment of DNN within constrained hardware devices--an area in which AutoML can play a big role.
Bio: Mohamed Abdelfattah is an Assistant Professor at Cornell Tech and in the Electrical and Computer Engineering Department at Cornell University. His research group is designing the next generation of machine-learning-centric computer systems for both datacenters and mobile devices. He received his BSc from the German University in Cairo, his MSc from the University of Stuttgart, and his PhD from the University of Toronto. After his PhD, Mohamed spent five years at Intel and Samsung Research.
Contact Information: https://www.mohsaied.com/ mohamed@cornell.edu
Abstract: Hardware architectures are undergoing major shifts due to slowing of Moore's law. Innovations in packaging technology has given rise to chiplet based architectures which we are seeing in CPUs and GPUs. In addition, we will in near future, see integration of heterogeneous chiplet modules on the same compute package thereby allowing a mix of accelerated workloads on the same architecture without offloading to PCIe attached accelerators. In addition, thanks to coherent fabrics like CXL, we will see more tighter integration between coherent accelerators and processors. This might also allow dis-aggregation of memory and storage at rack level. CXL will be an enabling technology for CPU-memory and CPU-accelerator allowing for disaggregation into fine grained resource pools. It will allow cloud vendors to provide precisely sized compute block for user workloads. Silicon photonics and in package optics will be able to ensure that latency induced by this dis-aggregated architecture is within tolerable limits and does not come at price of performance of latency sensitive application for most cases. All these hardware stack innovations will have massive impact on future accelerators and their development. It will be important to look at these underlying trends in hardware design and what they offer so that software can leverage performance gains accordingly. Architects will need to leverage the hardware and get performance with low overheads. The aim of the talk is to look at CXL enabled switch to pool memory, accelerator and storage with compute elements to provide customers economical and performance-oriented cloud infrastructure on demand. In addition, we aim to build a co designed serverless based API to provision these units and provide a cost-efficient experience to users.
Bio: Gaurav Kaul is a senior systems architect in Hewlett Packard Enterprise (HPE) where he leads HPC and AI systems design for large customers in EMEA region including pre-exascale machines like Archer2 (University of Edinburgh), LUMI (Finland) and Shaheen (Saudi Arabia). His work involves working with customers and understanding the workloads, hardware-software co-design on upcoming generations of accelerators and CPUs from AMD, Intel and Nvidia and onboarding users by sharing best practices and knowledge transfer. In addition to his role in HPE, Gaurav is involved in various standards like OCP, CXL, UCIe and MLIR for HPC and AI systems design. Prior to working in HPE, he has worked in AWS, Intel and IBM in various systems related domains and processor design. He holds a Masters in Computer Science from University of Manchester and lives in London, UK with his family.
Contact information: gaurav.kaul@hpe.com
Abstract: For programming FPGA-based accelerators, high level synthesis (HLS) is the mainstream approach. Unfortunately, HLS leaves a significant programmability gap in terms of reconfigurability, customization and versatility: 1. FPGA physical design can take hours, 2. FPGA reconfiguration time limits HLS from targeting complex workloads, and 3. HLS tools do not reason about cross-workload flexibility. Overlay approaches mitigate the above by mapping programmable designs (e.g. CPU, GPU, etc.) on top of FPGAs. However, the abstraction gap between overlay and FPGA leads to low efficiency/utilization.Our work develops a new FPGA programming paradigm, where an overlay architecture is automatically specialized to a set of representative applications. The key innovation is a highly-customizable overlay design space based on spatial architectures, which encompass a range of designs from application-specific to general purpose. We leverage and extend prior work on accelerator compilers, SoC generation, and fast design space exploration (DSE) to create an end-to-end FPGA acceleration system called OverGen. OverGen can compete in performance with state-of-the-art HLS techniques, while requiring 10,000x less compile time and reconfiguration time.
Bio: Tony Nowatzki is an associate professor in the Computer Science Department at the University of California, Los Angeles, where he leads the PolyArch Research Group. He joined UCLA in 2017 after completing his PhD at the University of Wisconsin - Madison. He was also a consultant for Simple Machines Inc., an AI hardware startup that used several of his patents in fabricated chips. Academic recognition includes four IEEE Micro Top Picks awards, a CACM Research Highlights, best paper nominations at MICRO and HPCA, and a PLDI Distinguished Paper Award.
Contact information: https://web.cs.ucla.edu/~tjn/ tjn@cs.ucla.edu
Abstract: In this project, we set out to find innovative ways to improve inference performance on CPU for better resource utilization. By utilizing sparse multiplication libraries and vendor provided optimized libraries, we can notably improve performance in ML inference
Bio: Jared Baumann is a dual major graduate from TSU who has been working with Flapmax for a little over a year. He specializes in the development of low-level solutions for improving performance.
Contact information: jared@flapmax.com
Abstract: Abstract: Entrepreneurs, researchers and AI practitioners that are climate-conscious (and receive monthly electricity bills) and in the search for a modern IT infrastructure to build their solutions need look no further. IBM Power server hardware paired with a RedHat OpenShift container software stack that stays “on” (with 99.999% availability) and a hardware-based AI accelerator (on-chip Matrix Math Accelerator) maintains a lower energy-footprint and Total Cost of Ownership (TCO) compared to x86 servers. Learn about the IBM Power servers, how you can leverage them to drive your AI roadmaps and benefit from the expanding ecosystem of vendors that support IBM Power at this session.
Bio: Yihui, a.k.a. "Ray", works in the general area of Artificial Intelligence (AI), its applications in science and its interaction with novel hardware. Ray's current research topics include unpaired image translation to bridge the gap between simulation and experiments, neural network optimization and deployment for real-time systems, novel hardware exploration and benchmarking, privacy-preserving AI, and bringing advanced AI methods to scientific domains.
Contact Information: https://www.bnl.gov/staff/yren yren@bnl.gov
Abstract: In the space of hardware acceleration alternatives, FPGAs lie in the middle of the programmability-efficiency spectrum, with GPUs being more programmable and ASICs being more efficient. FPGAs provide massive parallelism and are reconfigurable, which makes them very well suited for the fast-changing needs of DL applications. But how can we minimize the gap between ASICs and FPGAs in terms of performance and efficiency, while retaining their strength - the reconfigurability? This talk will dive into our research that attempts to answer this question by exploring better reconfigurable fabrics for Deep Learning. We will discuss how FPGAs are evolving into domain-specific reconfigurable fabrics. Specifically, we will look at new blocks called Tensor Slices and CoMeFa RAMs. These blocks are a significant step towards closing the performance gap between FPGAs and ASICs. In this talk, we will take a peek into the architecture of these blocks and talk about the performance improvement and energy reduction that can be obtained for DL application by using modern FPGAs containing these blocks.
Bio: Aman Arora is a PhD candidate and Graduate Fellow at The University of Texas at Austin. His research focuses on optimizing FPGA architecture to make them better Deep Learning accelerators. He has over 12 years of experience in the semiconductor industry in design, verification, testing and architecture roles. He is in the job market for a faculty job starting next Fall.
Contact information: https://amanarora.site aman.kbm@utexas.edu
Abstract: HPC developers/users in scientific domains such as climate modeling, CAD for Manufacturing, CFD, molecular dynamics, histopathology, seismology, protein folding, high energy physics, astrophysics etc. are exploring ways to use AI models to augment and accelerate or develop AI solutions for HPC simulations. Their problems typically have extremely large, multi-dimensional input data that are unlike those used in popular DL domains of image recognition, recommendation engine, language and text translations etc. This leads to significantly large memory usage and I/O ingestion challenges in both the input data pipeline and end-to-end HPC-AI workload pipelines. The talk will feature Intel’s best practices for optimizing HPC/AI workloads including, code optimizations using Intel Optimized TensorFlow and Pytorch using Intel extensions for training and Inference. These optimizations results in up to 3-4X improvement over Intel’s 3rd Gen Scalable Processor using AVX-512 using Intel’s 4th Gen Processor code named Sapphire Rapids with HBM using AMX/TMUL instructions supporting mixed-precision FP32 and BFloat16 for Training and quantized Inference.
Bio: Nalini Kumar works on HPC/AI workload optimization, analysis, and modeling at Intel in Santa Clara. Her primary research interests are in applying parallel, high-performance, and reconfigurable computing to traditional HPC as well as AI workloads. She is also interested in performance modeling and prediction of full applications and workflows on large-scale systems. She received her PhD and MS from in Electrical and Computer Engineering from University of Florida.
Contact information: nalini.kumar@intel.com