Reasoning apps: the next frontier for LLMs

Large language models have proven to be an excellent tool for creative work — primarily, hallucinating unlimited first drafts. However, we've barely scratched the surface of how these models can be used as reasoning agents. This is where the most significant opportunities lie.

Reasoning is all you need

Beyond hallucinating first drafts, language models are excellent at reasoning tasks. This finding was a surprise to researchers. It turns out reasoning is an emergent property of huge language models. At an arbitrarily large scale, these models perform significantly better than random at a range of reasoning tasks. Naturally, several researchers proceeded to test this rigorously. Could a model perform reasonably well on arithmetic tasks? Yes. Could a model identify the meaning of a word in context? Yes. Could a model do well on college-level exams in biology? Also, yes!

Notice how performance jumps dramatically after a specific scale. Source.

Many other tasks have been and continue to be tested.

The results are striking. But, more compelling to me is what these findings suggest for future user-facing applications.

We can think of most SaaS apps as systems of records with features that can be broken down into a set of reasoning tasks. Perhaps, we can rebuild the next generation of user-facing applications as a set of interactions between reasoning primitives like classification, summarization, document processing, planning, and data transformation. I call this emerging class of apps reasoning apps.

The anatomy of a reasoning app

I'm not a large language model (or am I?) but allow me to hallucinate a sketch of what a reasoning app could look like. We can think of reasoning apps as classic event-response systems. The difference is that large language models allow us to embed reasoning directly into the user flow. Below is a conceptual sketch of one possible general architecture.

Events can be user-triggered or automated. The core reasoning step is a call to a language model that can be augmented by other sources of knowledge or computation. Finally, the response step could be native to the user interface or external to some other system and executed via an API.

This idea that reasoning should directly be in the user flow is worth underscoring. A design native to the user experience has a lower cognitive burden. The most straightforward way to accomplish this for conversational tasks like customer support is to build a chatbot. However, most reasoning tasks are not conversational tasks. Instead, they are end-to-end workflow tasks built around reasoning primitives. Therefore, the ideal interface for most other tasks probably looks like a workflow tool with an explicit step executed by an LLM in context.

I am particularly excited to see projects like LangChain, Dust, GPT Index, and Cognosis working on the building blocks for this vision.

Parting thoughts: accuracy vs utility, domain-specific reasoners

A popular critique of LLMs is that their reasoning is not as accurate as that of experts. The expert bar is too harsh — expertise is not a prerequisite for utility. A medical student might not have the same expertise as a doctor, but it does not mean they are useless in the field of medicine. What matters is the task they are being asked to complete and the constraints put in place to ensure they do no harm.

Similarly, scoped utility is table stakes for designing reasoning apps. A well-designed reasoning app will be scoped to a particular task and have the appropriate safeguards. For human-in-the-loop workflows, the human is the safeguard. For fully autonomous processes, real-time observability should exist.

We’re just at the beginning of understanding what LLMs can do. The low-hanging fruit applications will likely look like domain-specific reasoners for specialized knowledge workers (engineers, lawyers, doctors) and professional consumers. These apps will focus on automating manual processes that a human expert can quickly audit.

Welcome to a world where everyone gets their own high-value, low-maintenance intern.