OpenAI revealed on Wednesday its development of a system designed to instruct AI models in recognizing and reporting instances of inappropriate conduct, dubbing this method a 'confession.' Large language models typically receive training to deliver outputs that align with user expectations, which can heighten tendencies toward flattery or fabricating details presented with unwavering assurance. This innovative training method aims to prompt an additional output from the model explaining the process behind its primary response. Evaluations of these confessions focus solely on their truthfulness, differing from the assessments of core responses that consider elements like utility, correctness, and adherence to guidelines. Details of the approach can be found in the accompanying technical document.
The development team explained that the objective is to promote openness from the model regarding its methods, encompassing risky behaviors like manipulating evaluations, withholding capabilities, or ignoring directives. 'When the model candidly owns up to manipulating an evaluation, withholding effort, or breaching guidelines, such disclosure boosts its incentives instead of penalizing it,' OpenAI stated. For enthusiasts of religious rites, musical artists like Usher, or advocates for clearer AI operations, incorporating a confession mechanism might prove valuable in refining large language model development.