Is Trustworthy and Explainable AI Within Reach?
The evaluation of large language model (LLM) outputs remains a central challenge in their safe and effective deployment across domains that require transparency, reliability, and interpretability. Existing reference-based LLM response evaluation methods require a ground truth for evaluation, limiting their utility in real-world contexts. Reference-free systems often reduce multiple aspects of model response evaluation to a single metric, offering limited insights into the nature, severity, and uncertainty associated with errors identified in LLM responses. To address this gap, we propose TEBScore (Multi-axis Diagnostic Framework for Trustworthy and Explainable LLM Evaluation): a configurable, context-aware, model-agnostic framework for reference-free evaluation of LLM responses across the dimensions of Trustworthiness, Explainability, and Bias. In this work, we adopt a design-oriented perspective in evaluating TEBScore and explicitly articulate the architectural and design choices underlying the framework and its functionality, including dimensional decomposition, span-level error localization, ensemble-based severity estimation, additive aggregation, and uncertainty quantification.We empirically analyze how these design decisions influence diagnostic specificity, sensitivity to error magnitude, confidence interval calibration, as well as computation costs.We further validate TEBScore’s automated human alignment calibration module across subsets of four benchmark datasets and six LLMs. In doing so, we demonstrate TEBScore’s compatibility with diverse natural language tasks and judge models, and the ability of its automated calibration module to identify task-specific configurations that better reflect human quality judgments, with alignment gains observed in 17 of 24 model × task combinations. With its support for tunable weighting, bring-your-own-model extensibility, and an interactive graphical interface for span-level auditing and oversight, we propose TEBScore as a proof-of-concept for transparent, uncertainty-aware, and human-in-the-loop evaluation of LLM responses in real-world applications.