STEM item generation: Can ChatGPT be culturally responsive?
Abstract
This exploratory study investigates bias in multiple-choice biology items generated by ChatGPT-4o, focusing not only on the impact of prompt phrasing but also on how a user’s query history influences item content. Specifically, it addresses three research questions: (1) How well does ChatGPT generate introductory biology items? (2) How does it interpret a request for culturally relevant content? and (3) How do outputs vary across three distinct user profiles? Using a standardized series of prompts, 100 items were generated per condition. All items were analyzed for factual accuracy, content representation, and patterns in correct answer distribution. Additional analyses for each research question evaluated the representation of scientists (e.g., perceived name diversity, gendered pronouns) and the depth of culturally responsive framing. While ChatGPT produced largely accurate items across conditions, there were biases that emerged. Culturally responsive prompts often yielded tokenized cultural statements rather than contextually rich items. Correct answers were non-randomly distributed, posing threats to test validity. Crucially, user query history influenced multiple aspects of the generated items: representation of content topics, representation of scientists, and what is considered “culture.” These findings have implications for test developers at any level considering genAI tools that preserve a user’s query history in assessment design, emphasizing the need for careful consideration of both prompt engineering as well as user history.
Keywords: AI, Culturally responsive, Test development
How to Cite:
Lambert, L. & Jones, M., (2026) “STEM item generation: Can ChatGPT be culturally responsive?”, Practical Assessment, Research, and Evaluation 30(2): 8. doi: https://doi.org/10.7275/pare.3152
Downloads:
Download PDF
View PDF
194 Views
19 Downloads
