At Enkefalos Technologies, we believe in research that translates into real impact.
We do research to solve problems that matter in the real world
Our clients operate in regulated, high-risk industries (insurance, finance, public safety).
These domains need trustworthy AI that can reason, infer, and adapt — not just autocomplete.
Generic LLMs are fragile and verbose. We’re fixing that by pushing the limits of model reasoning .
Each paper informs a product — whether it’s our InsurancGPT copilot , custom GenAI solutions, or low-resource language models .
Abstract
Language models have significantly advanced in their ability to generate coherent text and predict subsequent tokens based on given prompts. This study systematically compares the next-token prediction performance of two widely recognized models: OpenAI’s GPT-2 and Meta’s Llama-2-7b-chat-hf in Theory of Mind (ToM) tasks. To rigorously assess their capabilities, we constructed a diverse dataset using 10 short stories sourced from the Explore ToM Dataset. We enhanced these stories by programmatically inserting additional sentences (referred to as infills) using GPT-4, creating multiple variations to introduce varying levels of contextual complexity. This approach allows us to examine how increasing context influences model performance. We evaluate model behavior under different temperature settings (0.01, 0.5, 1.0, and 2.0) and test their ability to predict the next token across three distinct reasoning levels. Zero-order reasoning involves state tracking, which may probe either the current state (ground truth) or prior states (memory). Firstorder reasoning refers to understanding someone’s mental state (e.g., “Does Anne know the apple is salted?”). Second-order reasoning introduces an additional level of recursion in mental state tracking (e.g., “Does Anne think that Charles knows the apple is salted?”). Our findings reveal that increasing the number of infill sentences slightly reduces prediction accuracy, as added context introduces complexity and ambiguity. Llama-2 consistently outperforms GPT-2 in accuracy, particularly at lower temperatures, where it exhibits higher confidence in selecting the most probable next token. As question complexity increases, model responses diverge significantly. Notably, GPT-2 and Llama-2 exhibit greater response diversity in first and second-order reasoning tasks. These insights highlight how model architecture, temperature, and context affect next-token prediction, enhancing understanding of language model capabilities and limitations.