In the contemporary landscape of scientific research, the transformative potential of AI has become increasingly evident. This is particularly true when applying scalable AI systems to high-performance computing (HPC) platforms. This exploration of scalable AI for science underscores the necessity of integrating large-scale computational resources with vast datasets to address complex scientific challenges.
The success of AI models like ChatGPT highlights two primary advancements crucial for their effectiveness:
- The development of the transformer architecture
- The ability to train on extensive amounts of internet-scale data
These elements have set the foundation for significant scientific breakthroughs, as seen in efforts such as black hole modeling, fluid dynamics, and protein structure prediction. For instance, one study utilized AI and large-scale computing to advance models of black hole mergers, leveraging a dataset of 14 million waveforms on the Summit supercomputer.
A prime example of scalable AI’s impact is drug discovery, where transformer-based language models (LLMs) have revolutionized the exploration of chemical space. These models use extensive datasets and fine-tuning on specific tasks to autonomously learn and predict molecular structures, thereby accelerating the discovery process. LLMs can efficiently explore the chemical space by employing tokenization and mask prediction techniques, integrating pre-trained models for molecules and protein sequences with fine-tuning on small labeled datasets to enhance performance.
High-performance computing is indispensable for achieving such scientific advancements. Different scientific problems necessitate varying levels of computational scale, and HPC provides the infrastructure to handle these diverse requirements. This distinction sets AI for Science (AI4S) apart from consumer-centric AI, often dealing with sparse, high-precision data from costly experiments or simulations. Scientific AI requires handling specific scientific data characteristics, including incorporating known domain knowledge such as partial differential equations (PDEs). Physics-informed neural networks (PINNs), neural ordinary differential equations (NODEs), and universal differential equations (UDEs) are methodologies developed to meet these unique requirements.
Scaling AI systems involves both model-based and data-based parallelism. For example, training a large model like GPT-3 on a single NVIDIA V100 GPU would take centuries, but using parallel scaling techniques can reduce this time to just over a month on thousands of GPUs. These scaling methods are essential not only for faster training but also for enhancing model performance. Parallel scaling has two main approaches: model-based parallelism, needed when models exceed GPU memory capacity, and data-based parallelism, arising from the large data required for training.
Scientific AI differs from consumer AI in its data handling and precision requirements. While consumer applications might rely on 8-bit integer inferences, scientific models often need high-precision floating-point numbers and strict adherence to physical laws. This is particularly true for simulation surrogate models, where integrating machine learning with traditional physics-based approaches can yield more accurate and cost-effective results. Neural networks in physics-based applications might need to impose boundary conditions or conservation laws, especially in surrogate models that replace parts of larger simulations.
One critical aspect of AI4S is accommodating the specific characteristics of scientific data. This includes handling physical constraints and incorporating known domain knowledge, such as PDEs. Soft penalty constraints, neural operators, and symbolic regression are methods used in scientific machine learning. For instance, PINNs incorporate the PDE residual norm in the loss function, ensuring that the model optimizer minimizes both data loss and the PDE residual, leading to a satisfying physics approximation.
Parallel scaling techniques are diverse, including data-parallel and model-parallel approaches. Data-parallel training involves dividing a large batch of data across multiple GPUs, each processing a portion of the data simultaneously. On the other hand, model-parallel training distributes different parts of the model across various devices, which is particularly useful when the model size exceeds the memory capacity of a single GPU. Spatial decomposition can be applied in many scientific contexts where data samples are too large to fit on a single device.
The evolution of AI for science includes the development of hybrid AI-simulation workflows, such as cognitive simulations (CogSim) and digital twins. These workflows blend traditional simulations with AI models to enhance prediction accuracy and decision-making processes. For instance, in neutron scattering experiments, AI-driven methods can reduce the time required for experimental decision-making by providing real-time analysis and steering capabilities.
Several trends are shaping the landscape of scalable AI for science. The shift towards mixture-of-experts (MoE) models, which are sparsely connected and thus more cost-effective than monolithic models, is gaining traction. These models can handle many parameters efficiently, making them suitable for complex scientific tasks. The concept of an autonomous laboratory driven by AI is another exciting development. With integrated research infrastructures (IRIs) and foundation models, these labs can conduct real-time experiments and analyses, expediting scientific discovery.
The limitations of transformer-based models, such as context length and computational expense, have renewed interest in linear recurrent neural networks (RNNs), which offer greater efficiency for long token lengths. Additionally, operator-based models for solving PDEs are becoming more prominent, allowing AI to simulate entire classes of problems rather than individual instances.
Finally, interpretability and explainability in AI models must be considered. As scientists remain cautious of AI/ML methods, developing tools to elucidate the rationale behind AI predictions is crucial. Techniques like Class Activation Mapping (CAM) and attention map visualization help provide insights into how AI models make decisions, fostering trust and broader adoption in the scientific community.
Sources
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.