Research paper accepted by IEEE Transactions on Reliability

Deep learning shows great potential for bearing fault diagnosis, but its effectiveness is severely limited by the prevalent issue of highly imbalanced data in real-world industrial settings, where fault events are extremely rare. This paper proposes a novel method for imbalanced bearing fault diagnosis that combines class-aware supervised contrastive learning with a quadratic network backbone. This integrated approach, named CCQNet, is designed to counter the effects of highly skewed data distributions by improving feature representation and classification fairness. Comprehensive experiments show that CCQNet substantially outperforms existing methods in handling imbalanced data, particularly at high imbalance ratios like 50:1. This study provides an effective and innovative solution for imbalanced bearing fault diagnosis. Source codes of this paper are available at https://github.com/yuweien1120/CCQNet for public evaluation.

Research paper accepted by Reliability Engineering and Systems Safety

Although machine learning (ML) and deep learning (DL) methods are increasingly used for anomaly detection in industrial cyber-physical systems, their adoption is hindered by concerns about model trustworthiness, especially high false alarm rates (FARs). Excessive false alarms overwhelm operators, cause unnecessary shutdowns, and reduce operational efficiency. This study addresses these challenges by proposing a novel framework that integrates ML-based anomaly detectors with conformal prediction (CP), a model-agnostic uncertainty quantification technique. To handle distribution shifts in time-series data, our framework incorporates a temporal quantile adjustment method with a sliding calibration set, ensuring statistical guarantees on predefined FARs. A rejection mechanism is further integrated by excluding significant anomalies from the calibration set, improving detection capability while maintaining FAR guarantees. For real-time anomaly monitoring, two P-value-based indicators generated from CP are developed to track anomalous trends and enhance model interpretability. The framework is evaluated by comparing several baseline ML and DL methods to their conformalized counterparts using a public ICPS dataset. Comparative results based on Precision, Recall, F1, and AUROC validate the framework’s compatibility with various ML models and its effectiveness in improving anomaly detection performance by reducing false alarms and guaranteeing FARs across a range of predefined values.

Research paper accepted by IEEE Transactions on Systems, Man and Cybernetics: Systems

Neural networks that overlook the underlying causal relationships among observed variables pose significant risks in high-stake decision-making contexts due to the concerns about the robustness and stability of model performance. To tackle this issue, we present a general approach for embedding hierarchical causal structure among observed variables into neural network to inform its learning. The proposed methodology, termed causality-informed neural network (CINN), exploits hierarchical causal structure learned from observational data as a structurally informed prior to guide the layer-to-layer architectural design of the neural network while maintaining the orientation of causal relationships in the discovered causal graph. The proposed method involves three steps. First, CINN mines causal relationships from observational data via directed acyclic graph (DAG) learning, where causal discovery is recast as a continuous optimization problem to circumvent the combinatorial nature of DAG learning. Second, we encode the discovered hierarchical causal graph among observed variables into neural network via a dedicated architecture and loss function. By classifying observed variables in the DAG as root, intermediate, and leaf nodes, we translate the hierarchical causal DAG into CINN by creating a one-to-one correspondence between DAG nodes and certain CINN neurons. For the loss function, both intermediate and leaf nodes in the DAG are treated as target outputs during CINN training, facilitating the co-learning of causal relationships among the observed variables. Finally, as multiple loss components emerge in CINN, we leverage the projection of conflicting gradients to mitigate gradient interference among the multiple learning tasks. Computational studies indicate that CINN outperforms several state-of-the-art methods across a broad range of datasets. In addition, an ablation study that incrementally incorporates structural and quantitative causal knowledge into the neural network is conducted to highlight the pivotal role of causal knowledge in enhancing neural network’s prediction performance.

Research paper accepted by IEEE Transactions on Emerging Topics in Computational Intelligence

Equipping deep learning models with a principled uncertainty quantification (UQ) has become essential for ensuring their reliable performance in the open world. To handle uncertainty arising from two prevalent sources – distribution shift and out-of-distribution (OOD) – in the open-world settings, this paper presents a unified uncertainty-informed approach for quantifying and managing the risks these factors pose to the dependable function of deep learning models. Toward this goal, we propose leveraging a principled UQ approach — Spectral-normalized Neural Gaussian Process (SNGP) — to quantify the epistemic uncertainty associated with model predictions. Unlike other UQ methods in the literature, SNGP is characterized by two unique properties: (1) applying spectral normalization to the weights of the neural network’s hidden layers to preserve the relative distances among data points during data transformations; (2) replacing the traditional output layer of neural networks with a Gaussian process to enable distance-aware uncertainty estimation. Based on SNGP’s uncertainty estimate, we apply Youden’s index to determine an optimal threshold for categorizing the uncertainty into distinct levels, thereby enabling decision-makers to make uncertainty-informed decisions. Two datasets of varying scale are used to demonstrate how the proposed method facilitates risk assessment and management of deep learning models in the open environment. Computational results reveal that the proposed method achieves prediction performance comparable to Monte Carlo dropout and deep ensemble methods. Importantly, the proposed approach outperforms the other two methods by providing a computationally efficient, consistent, and principled uncertainty estimation under no distribution shift, distribution shift, and OOD conditions.

Research paper accepted by Reliability Engineering and Systems Safety

Multi-state systems (MSS) are widely used for modeling the behavior of engineering applications, where the system and its components can have more than two distinct states. Physics-Informed Neural Networks (PINNs) offer a viable solution for characterizing the dynamic state evolution of MSS. However, existing methods predominantly rely on uniformly sampled collocation points across the problem domain when training PINNs. Although some residual-based active learning methods exist, they are inherently static and local, and often fail to capture a crucial aspect of PINN training: identification and accurate modeling of the “critical transition regions” within the problem domain. To address this fundamental challenge, we treat PINN as a dynamic system and introduce a novel active learning method grounded in chaos theory to identify regions within the problem domain that are highly sensitive to initial conditions. Specifically, our method quantifies the degree of chaos at candidate collocation points by introducing small perturbations and using PINN’s forward propagation to simulate the dynamic evolution of both the original and perturbed collocation points. Collocation points that exhibit pronounced chaotic behavior—- where evolutionary trajectories diverge rapidly following perturbation—are identified as the system’s most unstable and valuable regions for PINN training. By prioritizing these dynamically unstable points, our method directs PINN to focus its learning on accurately delineating the boundaries of state transitions, thereby significantly enhancing the accuracy of reliability analysis. Experimental results on multiple benchmark partial differential equations (PDEs) and several MSSs demonstrate that, compared to other PINN learning schemes, our method shows superior accuracy and computational efficiency in MSS reliability assessment.

Research paper accepted by INFORMS Journal on Computing

Accurate and reliable prediction has profound implications to a wide range of applications, such as hospital admissions, inventory control, route planning. In this study, we focus on an instance of spatio-temporal learning problem–traffic prediction–to demonstrate an advanced deep learning model developed for making accurate and reliable prediction. Despite the significant progress in traffic prediction, limited studies have incorporated both explicit (e.g., road network topology) and implicit (e.g., causality-related traffic phenomena and impact of exogenous factors) traffic patterns simultaneously to improve prediction performance. Meanwhile, the variability nature of traffic states necessitates quantifying the uncertainty of model predictions in a statistically principled way; however, extant studies offer no provable guarantee on the statistical validity of confidence intervals in reflecting its actual likelihood of containing the ground truth. In this paper, we propose an end-to-end traffic prediction framework that leverages three primary components to generate accurate and reliable traffic predictions: dynamic causal structure learning for discovering implicit traffic patterns from massive traffic data, causally-aware spatio-temporal multi-graph convolution network (CASTMGCN) for learning spatio-temporal dependencies, and conformal prediction for uncertainty quantification. In particular, CASTMGCN fuses several graphs that characterize different important aspects of traffic networks (including physical road structure, time-lagged causal effect, contemporaneous causal relationships) and an auxiliary graph that captures the effect of exogenous factors on the road network. On this basis, a conformal prediction approach tailored to spatio-temporal data is further developed for quantifying the uncertainty in node-wise traffic predictions over varying prediction horizons. Experimental results on two real-world traffic datasets of varying scale demonstrate that the proposed method outperforms several state-of-the-art models in prediction accuracy; moreover, it generates more efficient prediction regions than several other methods while strictly satisfying the statistical validity in coverage.

Research paper accepted by Transportation Research Part E

Understanding causal relationships between traffic states throughout the system is of great significance for enhancing traffic management and optimization in urban traffic networks. Unfortunately, few studies in the literature have systematically analyzed causal structure characterizing the evolution of traffic states over time and gauged the importance of traffic nodes from a causal perspective, particularly in the context of large-scale traffic networks. Moreover, the dynamic nature of traffic patterns necessitates a robust method to reliably discover causal relationships, which are often overlooked in existing studies. To address these issues, we propose a Spatio-Temporal Causal Structure Learning and Analysis (STCSLA) framework for analyzing large-scale urban traffic networks at a mesoscopic level from a causal lens. The proposed framework comprises three main components: decomposition of spatio-temporal traffic data into localized traffic subprocesses; a Bayesian Information Criterion-guided spatio-temporal causal structure learning combined with temporal-dependencies preserving sampling for deriving reliable causal graph to uncover time-lagged and contemporaneous causal effects; establishing several causality-oriented indicators to identify causally critical nodes, mediator nodes, and bottleneck nodes in traffic networks. Experimental results on both a synthetic dataset and the real-world Hong Kong traffic dataset demonstrate that the proposed STCSLA framework accurately uncovers time-varying causal relationships and identifies key nodes that play various causal roles in influencing traffic dynamics. These findings underscore the potential of the proposed framework to improve traffic management and provide a comprehensive causality-driven approach for analyzing urban traffic networks.

Review paper on AI system reliability is accepted by Journal of Reliability Science and Engineering

As the potential applications of AI continue to expand, a central question remains unresolved: will users trust and adopt AI-powered technologies? Since AI’s promise closely hinges on the perceptions of its trustworthiness, how to guarantee the reliability and trustworthiness of AI plays a fundamental role in fostering its broad adoptions in practice. However, the theories, mathematical models, and methods in reliability engineering and risk management have not kept pace with the rapid technological progress in AI. As a result, the lack of essential components (e.g., reliability, trustworthiness) in the resultant models has emerged as a major roadblock to regulatory approval and widespread adoptions of AI-powered solutions in high-stakes decision environments, such as healthcare, aviation, finance, nuclear power plant, to name a few. To fully harness AI’s power for automating decision making in these safety-critical applications, it is essential to manage expectations for what AI can realistically deliver to build appropriate levels of trust. In this paper, we focus on functional reliability of AI systems developed through supervised learning and discuss the unique characteristics of AI systems that necessitate the development of specialized reliability engineering and risk management theories and methods to create functionally reliable AI systems. Next, we thoroughly review five prevalent engineering mechanisms in the existing literature for approaching functionally reliable and trustworthy AI, including uncertainty quantification (UQ) composed of model-based UQ and model-agnostic conformal prediction, failure prediction, learning with abstention, formal verification, and knowledge-enabled AI. Furthermore, we outline several research challenges and opportunities related to the development of reliability engineering and trustworthiness assurance methods for AI systems. Our research aims to deepen the understanding of reliability and trustworthiness issues associated with AI systems, and spark researchers in the field of risk and reliability engineering and beyond to contribute to this area of study with emerging importance.

Research paper accepted by European Journal of Operational Research

It is common for multiple firms\textemdash such as manufacturers, retailers, and third-party insurers\textemdash to coexist and compete in the aftermarket for durable products. In this paper, we study price competition in a partially concentrated aftermarket where one firm offers multiple extended warranty (EW) contracts while the others offer a single one. The demand for EWs is described by the multinomial logit model. We show that, at equilibrium, such an aftermarket behaves like a combination of monopoly and oligopoly. Building upon this base model, we further investigate sequential pricing games for a durable product and its EWs to accommodate the ancillary nature of after-sales services. We consider two scenarios: one where the manufacturer (as the market leader) sets product and EW prices \emph{simultaneously}, and another where these decisions are made \emph{sequentially}. Our analysis demonstrates that offering EWs incentivizes the manufacturer to lower the product price, thereby expanding the market potential for EWs. Simultaneous product-EW pricing leads to a price concession on EWs compared to sequential pricing, effectively reducing the intensity of competition in the aftermarket. Overall, the competitiveness of an EW hinges on its ability to deliver high value to consumers at low marginal cost to its provider. While our focus is on EWs, the proposed game-theoretical pricing models apply broadly to other ancillary after-sales services.

Research paper accepted by IEEE Transactions on Automation Science and Engineering

The demand for disruption-free fault diagnosis of mechanical equipment under a constantly changing operation environment poses a great challenge to the deployment of data-driven diagnosis models in practice. Extant continual learning-based diagnosis models suffer from consuming a large number of labeled samples to be trained for adapting to new diagnostic tasks and failing to account for the diagnosis of heterogeneous fault types across different machines. In this paper, we use a representative mechanical equipment – rotating machinery — as an example and develop an uncertainty-aware continual learning framework (UACLF) to provide a unified interface for fault diagnosis of rotating machinery under various dynamic scenarios: class continual scenario, domain continual scenario, and both. The proposed UACLF takes a three-step to tackle fault diagnosis of rotating machinery with homogeneous-heterogeneous faults under dynamic environments. In the first step, an inter-class classification loss function and an intra-class discrimination loss function are devised to extract informative feature representations from the raw vibration signal for fault classification. Next, an uncertainty-aware pseudo labeling mechanism is developed to select unlabeled fault samples that we are able to assign pseudo labels confidently, thus expanding the training samples for faults arising in the new environment. Thirdly, an adaptive prototypical feedback mechanism is used to enhance the decision boundary of fault classification and diminish the model misclassification rate. Experimental results on three datasets suggest that the proposed UACLF outperforms several alternatives in the literature on fault diagnosis of rotating machinery across various working conditions and different machines.