材料科學極大地受益于機器學習和深度學習技術的進步。這些技術徹底改變了對分子性質的預測,促使傳統計算方法得以改變。機器學習/深度學習技術作為數據驅動材料科學領域中不可或缺的工具,其性能預測的準確性和速度都在逐步提高。Fig. 1 Overview of extrapolative prediction of molecular property?based on the range of molecular properties and the diversity of?molecular structures.但在機器學習/深度學習技術中仍然存在一個關于其固有外推困難的基本矛盾,即對于超越現有數據的預測能力。數據驅動材料探索的主要目標是識別尚未在數據庫中出現的高性能分子/材料。因此,機器學習/深度學習模型必須具有僅從現有數據中推斷未知數據的能力。
Fig. 2 Model description used for the benchmark.然而,材料數據集通常由小型實驗結果組成,因而不可避免地會存在偏差。確定機器學習/深度學習模型能否克服這些偏差,并有效地推斷分子性質至關重要。
Fig. 3 Evaluation methods for assessing interpolation and extrapolative performance.
Fig. 4 Evaluation results of the interpolation test using all data points of each dataset and extrapolation tests of property range and?molecular structure (cluster) at data size for interpolation Nin = 200 (50 for EBD) with RMSE relative to σall, where σall represents the?standard deviation of each dataset as listed in Table 1.?
為解決這一挑戰,他們引入了一個稱為QMex的量子力學描述符數據集,以及包含量子力學描述符和分子結構分類信息之間交互項的交互式線性回歸。基于QMex的交互式線性回歸在保持其可解釋性的同時,實現了最先進的外推性能。Fig. 5 Ratio of models ranking within the top three for each data size Nin.他們的基準結果、QMex數據集和所提出的模型對于改進小型實驗數據集的外推預測,并發現超越現有候選材料的新材料/分子極具價值。該文近期發布于npj Computational Materials 10: 11 (2024).
Fig. 6 Model performance comparison for extrapolation tests.
Editorial Summary
Extrapolative prediction of small-data molecular property:Quantum mechanics-assisted machine learning
Materials science has greatly benefited from advancements in machine learning (ML) and deep learning (DL) techniques. These techniques have revolutionized the prediction of molecular properties, leveraging traditional computational approaches.ML/DL techniques continue to enhance the accuracy and speed of property prediction, serving as indispensable tools for data-driven materials science.?
Fig. 7 Summary of ML/DL model selection for interpolation and?extrapolation of molecular property prediction.
However, a fundamental contradiction persists in ML/DL techniques regarding their inherent extrapolation difficulty, i.e., the ability to predict beyond the available data. The primary objective of data-driven materials exploration is to identify high-performance molecules/materials that are not yet represented in databases. Hence, ML/DL models must possess the capability to extrapolate unexplored data solely from the available data. However, materials datasets often consist of small experimental results, which inevitably carries biases. It is crucial to determine whether ML/DL models can overcome these biases and effectively extrapolate molecular properties.?Fig. 8 Model performance comparison between QMex-LR and QMex-ILR.Hajime Shimakawa et al. from the Department of Electrical Engineering & Information Systems, School of Engineering, University of Tokyo, presented a comprehensive benchmark for assessing extrapolative performance across 12 organic molecular properties. Their large-scale benchmark revealed that conventional ML models exhibit remarkable performance degradation beyond the training distribution of property range and molecular structures, particularly for small-data properties. To address this challenge, they introduced a quantum-mechanical (QM) descriptor dataset, called QMex, and an interactive linear regression (ILR), which incorporates interaction terms between QM descriptors and categorical information pertaining to molecular structures. The QMex-based ILR achieved state-of-the-art extrapolative performance while preserving its interpretability. Their benchmark results, QMex dataset, and proposed model serve as valuable assets for improving extrapolative predictions with small experimental datasets and for the discovery of novel materials/molecules that surpass existing candidates. This article was recently published in npj Computational Materials 10: 11 (2024).原文Abstract及其翻譯Extrapolative prediction of small-data molecular property using quantum mechanics-assisted machine learning (量子力學輔助機器學習對小數據分子性質外推預測)Hajime Shimakawa, Akiko Kumada & Masahiro SatoAbstract Data-driven materials science has realized a new paradigm by integrating materials domain knowledge and machine-learning (ML) techniques. However, ML-based research has often overlooked the inherent limitation in predicting unknown data: extrapolative performance, especially when dealing with small-scale experimental datasets. Here, we present a comprehensive benchmark for assessing extrapolative performance across 12 organic molecular properties. Our large-scale benchmark reveals that conventional ML models exhibit remarkable performance degradation beyond the training distribution of property range and molecular structures, particularly for small-data properties. To address this challenge, we introduce a quantum-mechanical (QM) descriptor dataset, called QMex, and an interactive linear regression (ILR), which incorporates interaction terms between QM descriptors and categorical information pertaining to molecular structures. The QMex-based ILR achieved state-of-the-art extrapolative performance while preserving its interpretability. Our benchmark results, QMex dataset, and proposed model serve as valuable assets for improving extrapolative predictions with small experimental datasets and for the discovery of novel materials/molecules that surpass existing candidates.摘要數據驅動材料科學通過整合材料領域知識和機器學習(ML)技術,實現了一種新的范式。然而,基于機器學習的研究往往忽略了其預測未知數據的固有局限性:即外推性能,特別是在處理小規模實驗數據集時。在這里,我們提出了一個全面的基準來評估12種有機分子性質的外推性能。我們的大規模基準測試顯示,傳統的機器學習模型在屬性范圍和分子結構的訓練分布之外表現出顯著的性能下降,特別是對小數據屬性。為解決這一挑戰,我們引入了一個稱為QMex的量子力學(QM)描述符數據集,以及包含量子力學描述符和分子結構分類信息之間交互項的交互式線性回歸(ILR)。基于QMex的交互式線性回歸在保持其可解釋性的同時,實現了最先進的外推性能。我們的基準結果、QMex數據集和所提出的模型對于改進小型實驗數據集的外推預測,并發現超越現有候選材料的新材料/分子極具價值。