At Enkefalos Technologies, we believe in research that translates into real impact.
We do research to solve problems that matter in the real world
Our clients operate in regulated, high-risk industries (insurance, finance, public safety).
These domains need trustworthy AI that can reason, infer, and adapt — not just autocomplete.
Generic LLMs are fragile and verbose. We’re fixing that by pushing the limits of model reasoning .
Each paper informs a product — whether it’s our InsurancGPT , custom GenAI solutions, or low-resource language models .
Abstract
Recent advancements in Large Language Models (LLMs) have sparked interest in their structured reasoning capabilities, particularly in abstraction and pattern recognition tasks. The Abstraction and Reasoning Corpus (ARC) benchmark serves as a key evaluation tool for assessing AI models’ ability to generalize and solve novel reasoning tasks. While GPT-4o successfully solves all ARC tasks at zero noise, models such as DeepSeek R1 and LLaMA 3.2 fail to solve any, raising questions about their abstraction and generalization capabilities beyond pattern matching. To investigate this further, we evaluate these models under varying noise levels and temperature settings. Our findings indicate that introducing noise significantly degrades performance across all models, underscoring their fragility under uncertain conditions. This suggests that while some models demonstrate reasoning abilities, they remain highly sensitive to input perturbations, limiting their robustness. By analyzing how different architectures handle noise and uncertainty, we provide insights into the limitations of current AI systems in structured reasoning. Our study highlights the need for more resilient AI models that can adapt to real-world complexity, informing future research on improving generalization, robustness, and alignment with human cognitive flexibility.