Eyes of the Beholders
Error in the Loop: How Human Mistakes Can Improve Algorithmic Learning
Ryan Copus, Cait Spackman & Hannah Laqueur
Journal of Law & Empirical Analysis, forthcoming
Abstract:
Algorithms often outperform humans in making decisions, in large part because they are more consistent. Despite this, there remains widespread demand to keep a "human in the loop" to address concerns about fairness and transparency. Although evidence suggests that most human overrides are errors, we argue these errors can provide value: they generate new data from which algorithms can learn. To remain accurate, algorithms must be updated over time, but data generated solely from algorithmic decisions is biased, including only cases selected by the algorithm (e.g., individuals released on parole). Training on this algorithmically selected data can significantly reduce predictive accuracy. When a human overrides an algorithmic denial, it generates valuable training data for updating the algorithm. On the other hand, overriding a grant removes potentially useful data. Fortunately, demand for human oversight is strongest for algorithmic denials of benefits, where overrides add the most value. This alignment suggests a politically feasible and accuracy-enhancing reform: limiting human overrides to algorithmic denials. The article illustrates the accuracy-sustaining benefits of strategically keeping "error in the loop" with datasets on parole, credit, and law school admissions. In all three contexts, we demonstrate that simulated human overrides of algorithmic denials significantly improve the predictive value of newly generated data.
AI-AI bias: Large language models favor communications generated by large language models
Walter Laurito et al.
Proceedings of the National Academy of Sciences, 5 August 2025
Abstract:
Are large language models (LLMs) biased in favor of communications produced by LLMs, leading to possible antihuman discrimination? Using a classical experimental design inspired by employment discrimination studies, we tested widely used LLMs, including GPT-3.5, GPT-4 and a selection of recent open-weight models in binary choice scenarios. These involved LLM-based assistants selecting between goods (the goods we study include consumer products, academic papers, and film-viewings) described either by humans or LLMs. Our results show a consistent tendency for LLM-based AIs to prefer LLM-presented options. This suggests the possibility of future AI systems implicitly discriminating against humans as a class, giving AI agents and AI-assisted humans an unfair advantage.
A foundation model to predict and capture human cognition
Marcel Binz et al.
Nature, forthcoming
Abstract:
Establishing a unified theory of cognition has been an important goal in psychology. A first step towards such a theory is to create a computational model that can predict human behaviour in a wide range of settings. Here we introduce Centaur, a computational model that can predict and simulate human behaviour in any experiment expressible in natural language. We derived Centaur by fine-tuning a state-of-the-art language model on a large-scale dataset called Psych-101. Psych-101 has an unprecedented scale, covering trial-by-trial data from more than 60,000 participants performing in excess of 10,000,000 choices in 160 experiments. Centaur not only captures the behaviour of held-out participants better than existing cognitive models, but it also generalizes to previously unseen cover stories, structural task modifications and entirely new domains. Furthermore, the model's internal representations become more aligned with human neural activity after fine-tuning. Taken together, our results demonstrate that it is possible to discover computational models that capture human behaviour across a wide range of domains. We believe that such models provide tremendous potential for guiding the development of cognitive theories, and we present a case study to demonstrate this.
Empirical evidence of Large Language Model's influence on human spoken communication
Hiromu Yakura et al.
Max-Planck Institute for Human Development Working Paper, July 2025
Abstract:
From the invention of writing and the printing press, to television and social media, human history is punctuated by major innovations in communication technology, which fundamentally altered how ideas spread and reshaped our culture. Recent chatbots powered by generative artificial intelligence constitute a novel medium that encodes cultural patterns in their neural representations and disseminates them in conversations with hundreds of millions of people. Understanding whether these patterns transmit into human language, and ultimately shape human culture, is a fundamental question. While fully quantifying the causal impact of a chatbot like ChatGPT on human culture is very challenging, lexicographic shift in human spoken communication may offer an early indicator of such broad phenomenon. Here, we apply econometric causal inference techniques to 740,249 hours of human discourse from 360,445 YouTube academic talks and 771,591 conversational podcast episodes across multiple disciplines. We detect a measurable and abrupt increase in the use of words preferentially generated by ChatGPT, such as delve, comprehend, boast, swift, and meticulous, after its release. These findings suggest a scenario where machines, originally trained on human data and subsequently exhibiting their own cultural traits, can, in turn, measurably reshape human culture. This marks the beginning of a closed cultural feedback loop in which cultural traits circulate bidirectionally between humans and machines. Our results motivate further research into the evolution of human-machine culture, and raise concerns over the erosion of linguistic and cultural diversity, and the risks of scalable manipulation.
When expert advice fails to reduce the productivity gap: Experimental evidence from chess players
Elias Bouacida, Renaud Foucart & Maya Jalloul
Journal of Economic Behavior & Organization, August 2025
Abstract:
We study the impact of external advice on the relative performance of chess players. We asked players in chess tournaments to evaluate positions in past games and allowed them to revise their evaluation after observing the answers of a higher or a lower-ability adviser. Although high-quality advice has the potential to serve as a "great equalizer," reducing the difference between higher- and lower-ability players, it did not happen in our experiment. One reason is that lower-ability players tend to pay a higher premium by sticking to their initial evaluation rather than following high-quality advice.
Racial Bias and Decision Fatigue
Sungwoo Cho
United States Military Academy Working Paper, June 2025
Abstract:
Are people more discriminatory when they are tired? I examine whether a fatigued decision-maker makes more or less discriminatory decisions in two separate settings: bail hearings, and baseball games. In the two settings, I find that decision-makers favor the own-race examinee early in the session, but the favoritism diminishes gradually and completely disappears by the end of the session. In addition, the decision-makers' accuracy declines with fatigue, but they also become more lenient. These findings suggest that targeted scrutiny early in the session can mitigate the presence of discrimination.
The evolution of risk attitudes: A panel study of the university years
Catherine Eckel, Rick Wilson & Nanyin Yang
Journal of Risk and Uncertainty, June 2025, Pages 225-248
Abstract:
We analyze a unique longitudinal dataset of university students to investigate the stability of risk preferences over a five-year period. Our findings indicate that subjects' risk tolerance, as measured by incentivized lottery choices, tends to increase over time, while it moves in the opposite direction when assessed through a non-incentivized survey question. Furthermore, we exploit the COVID-19 pandemic to explore the impact of negative experiences and emotions on the temporal changes in subjects' risk preferences. Our analysis reveals that, within the same group of respondents, the risk tolerance elicited by the incentivized measure proves to be more stable, whereas the survey measure exhibits greater sensitivity, declining in response to negative shocks. These results enhance our understanding of how risk preferences evolve over time and emphasize the importance of employing appropriate measurement methods when investigating risk attitudes.
Ego does not deplete over time
Alberto De Luca et al.
Experimental Psychology, March 2025, Pages 100-113
Abstract:
The idea that self-control (or executive) functions depend on limited "mental resources" that can be depleted (aka ego-depletion) has generated a lot of interest, but both the empirical status of the phenomenon and its theoretical explanation remain controversial. Here, we tested a widely neglected but straightforward prediction of ego-depletion theory: The longer people work on a control-demanding task, the more should their ego deplete. If so, ego-depletion effects should become more pronounced as time on (control) task increases. To test that prediction, we carried out an online experiment, in which participants switched between blocks of a numerical Stroop task (NST) with either 50% or 10% incongruent trials, which served to induce different degrees of ego depletion, and a Global-Local Task (GLT), which served to measure the impact of ego depletion. We predicted that participants would perform more poorly on the GLT if it is combined with the more demanding NST and that this performance cost would systematically increase over time on task. Although the classical Stroop and global-local effects were replicated, we found no evidence that our experimental manipulation successfully induced an outcome that can be considered as evidence for ego depletion. We conclude that our findings contribute to the growing literature questioning the robustness of ego-depletion effects under certain task conditions.
Statistical or Embodied? Comparing Colorseeing, Colorblind, Painters, and Large Language Models in Their Processing of Color Metaphors
Ethan Nadler et al.
Cognitive Science, July 2025
Abstract:
Can metaphorical reasoning involving embodied experience -- such as color perception -- be learned from the statistics of language alone? Recent work finds that colorblind individuals robustly understand and reason abstractly about color, implying that color associations in everyday language might contribute to the metaphorical understanding of color. However, it is unclear how much colorblind individuals' understanding of color is driven by language versus their limited (but no less embodied) visual experience. A more direct test of whether language supports the acquisition of humans' understanding of color is whether large language models (LLMs) -- those trained purely on text with no visual experience -- can nevertheless learn to generate consistent and coherent metaphorical responses about color. Here, we conduct preregistered surveys that compare colorseeing adults, colorblind adults, and LLMs in how they (1) associate colors to words that lack established color associations and (2) interpret conventional and novel color metaphors. Colorblind and colorseeing adults exhibited highly similar and replicable color associations with novel words and abstract concepts. Yet, while GPT (a popular LLM) also generated replicable color associations with impressive consistency, its associations departed considerably from colorseeing and colorblind participants. Moreover, GPT frequently failed to generate coherent responses about its own metaphorical color associations when asked to invert its color associations or explain novel color metaphors in context. Consistent with this view, painters who regularly work with color pigments were more likely than all other groups to understand novel color metaphors using embodied reasoning. Thus, embodied experience may play an important role in metaphorical reasoning about color and the generation of conceptual connections between embodied associations.
Environmental sounds impact memory: Effects of city-related and nature-related sounds on episodic memory
Zerin Fejzic et al.
Applied Psychology, August 2025
Abstract:
Research from a broad range of scientific disciplines suggests that aspects of city environments, such as city-related sounds, are associated with poor health and cognitive outcomes, whereas aspects of natural environments are associated with positive outcomes. Strikingly, essentially, no experimental work has examined effects of city- as well as nature-related sound exposure on episodic memory, which is surprising given that people often live in sound-exposed environments. We examine the effect of city-related sounds, nature-related sounds, and white noise (control) sounds on both item memory (i.e., memory for studied materials) as well as context memory (i.e., memory for episodic details associated with studied items) to gain a richer understanding of the effects of different environmental sounds on episodic memory. Results showed that exposure to the different sound conditions (city-related, nature-related) had no effect on item memory; however, exposure to city-related sounds significantly reduced context memory compared to both the nature-related and white noise (control) conditions, implying a cost to episodic memory from exposure to city-related sounds. These results imply that exposure to city-related sounds leads to reduced ability to form detail-rich memories, which builds on existing work suggesting city-related sound exposure harms aspects of health and cognition.