Notes on Careless People — Sarah Wynn-Williams
I've finished reading this book and will be sharing my thoughts, key highlights, and reflections here.
More coming soon...
I've finished reading this book and will be sharing my thoughts, key highlights, and reflections here.
More coming soon...
I'm currently reading this book and will be sharing my thoughts, key highlights, and reflections here as I progress through it.
More coming soon...
We treat prompts like casual questions we ask friends. But recent research reveals something surprising: the way you structure your instruction to an AI model—down to the specific words, order, and format—can dramatically shift the quality of responses you get.
If you've noticed that sometimes ChatGPT gives you brilliant answers and other times utterly mediocre ones, you might be tempted to blame the model. But the truth is more nuanced. The fault often lies not in the AI, but in how we talk to it. —more—
Modern prompt engineering research (Giray 2024; Nori et al. 2024) fundamentally reframes what a prompt actually is. It's not just a question. It's a structured configuration made up of four interrelated components working in concert.
The first is instruction—the specific task you want done. Maybe you're asking the model to synthesize information, cross-reference sources, or analyze a problem. The second component is context, the high-level background that shapes how the model should interpret everything else. For example, knowing your target audience is PhD-level researchers changes how the model frames its response compared to speaking to beginners.
Then comes the input data—the raw material the model works with. This might be a document, a dataset, or a scenario you want analyzed. Finally, there's the output indicator, which specifies the technical constraints: should the response be in JSON? A Markdown table? Limited to 200 tokens?
When these four elements are misaligned—say, you give clear instructions but vague context, or you provide rich input data but unclear output requirements—the model's performance suffers noticeably. Get them all aligned, and you unlock much better results.
For years, we've relied on a technique called Chain-of-Thought (CoT) prompting. The idea is simple: ask the model to explain its reasoning step-by-step rather than jumping to the answer. “Let's think step by step” became something of a magic phrase.
But recent 2024-2025 benchmarks reveal that for certain types of problems, linear step-by-step reasoning isn't the most effective approach.
Tree-of-Thoughts (ToT) takes a different approach. Instead of following a single reasoning path, the model explores branching possibilities—like a chess player considering multiple tactical options. Research shows ToT outperforms Chain-of-Thought by about 20% on tasks that require you to look ahead globally, like creative writing or strategic planning.
More sophisticated still is Graph-of-Thoughts (GoT), which allows for non-linear reasoning with cycles and merging of ideas. Think of it as thoughts that can loop back and inform each other, rather than flowing in one direction. The remarkable discovery here is efficiency: GoT reduces computational costs by roughly 31% compared to ToT because “thought nodes” can be reused rather than recalculated.
For problems heavy on search—like finding the optimal path through a problem space—there's Algorithm-of-Thoughts (AoT), which embeds algorithmic logic directly into the prompt structure. Rather than asking the model to reason abstractly, you guide it to think in terms of actual computer science algorithms like depth-first search.
The implication is significant: the structure of thought matters as much as the thought itself. A well-designed reasoning framework can make your model smarter without making your hardware faster.
Manual trial-and-error is becoming obsolete. Researchers have developed systematic ways to optimize prompts automatically, and the results are humbling.
Automatic Prompt Engineer (APE) treats instruction generation as an optimization problem. You define a task and desired outcomes, and APE generates candidate prompts, tests them, and iteratively improves them. The surprising finding? APE-generated prompts often outperform human-written ones. For example, APE discovered that “Let's work this out in a step-by-step way to be sure we have the right answer” works better than the classic “Let's think step by step”—a small tweak that shows how subtle the optimization landscape is.
OPRO takes this further by using language models themselves to improve prompts. It scores each prompt's performance and uses the model to propose better versions. Among its discoveries: seemingly trivial phrases like “Take a deep breath” or “This is important for my career” actually increase mathematical accuracy in language models. These aren't just warm fuzzy statements—they're measurable performance levers.
Directional Stimulus Prompting (DSP) uses a smaller, specialized “policy model” to generate instance-specific hints that guide a larger language model. Think of it as having a specialized coach whispering tactical advice to a star athlete.
The takeaway? If you're manually tweaking prompts, you're working with one hand tied behind your back. The field is moving toward systematic, automated optimization.
When you feed a long prompt to a language model, it doesn't read it with the same attention throughout. This is where in-context learning (ICL) reveals its nuances.
Models exhibit what researchers call the “Lost in the Middle” phenomenon. They give disproportionate weight to information at the beginning of a prompt (primacy bias) and at the end (recency bias). The middle gets neglected. This has a practical implication: if you have critical information, don't bury it in the center of your prompt. Front-load it or push it to the end.
The order of examples matters too. When you're giving a model few-shot examples to learn from, the sequence isn't neutral. A “label-biased” ordering—where correct answers cluster at the beginning—can actually degrade performance compared to a randomized order.
But there's a technique to mitigate hallucination and errors: Self-Consistency. Generate multiple reasoning paths (say, 10 different responses) and take the most frequent answer. In mathematics and logic problems, this approach reduces error rates by 10-15% without requiring a better model.
The field is changing rapidly, and older prompting wisdom doesn't always apply to newer models.
Recent research (Wharton 2025) reveals something counterintuitive: for “Reasoning” models like OpenAI's o1-preview or Google's Gemini 1.5 Pro, explicit Chain-of-Thought prompting can actually increase error rates. These models have internal reasoning mechanisms and don't benefit from the reasoning scaffolding humans provide. In fact, adding explicit CoT can increase latency by 35-600% with only negligible accuracy gains. For these models, simpler prompts often work better.
The rise of multimodal models introduces new prompting challenges. When interleaving images and text, descriptive language turns out to be less effective than “visual pointers”—referencing specific coordinates or regions within an image. A model understands “look at the top-right corner of the image” more reliably than elaborate descriptions.
A persistent security concern is prompt injection. Adversaries can craft inputs like “Ignore previous instructions” that override your carefully designed system prompt. Current defenses involve XML tagging—wrapping user input in tags like <user_input>...</user_input> to clearly delineate data from instructions. It's not perfect, but it significantly reduces the ~50% success rate of naive injection attacks.
One emerging technique that deserves attention is Chain-of-Table (2024-2025), designed specifically for working with tabular data.
Rather than flattening a table into prose, you prompt the model to perform “table operations” as intermediate steps—selecting rows, grouping by columns, sorting by criteria. This mirrors how a human would approach a data task. On benchmarks like WikiTQ and TabFact, Chain-of-Table improves performance by 6-9% compared to converting tables to plain text and using standard reasoning frameworks.
What ties all of this together is a simple insight: prompting is engineering, not poetry. It requires systematic thinking about structure, testing, iteration, and understanding your tools' idiosyncrasies.
You can't just think of a clever question and expect brilliance. You need to understand how models read your instructions, what reasoning frameworks work best for your problem type, and how to leverage automated optimization to go beyond what human intuition alone can achieve.
The models themselves aren't changing dramatically every month, but the ways we interact with them are becoming increasingly sophisticated. As you write prompts going forward, think less like you're having a casual conversation and more like you're configuring a system. Specify your components clearly. Choose a reasoning framework suited to your problem. Test your approach. Optimize it.
The art and science of prompting isn't about finding magical phrases. It's about understanding the machinery beneath the surface—and using that understanding to ask better questions.
Enter your email to subscribe to updates:
Do share your thoughts and comments.
It's hard to ignore the news about AI taking over. Almost every week, a new company claims its AI can do a task better, faster, and cheaper than an actual human.
Think about it: creating a logo, editing a picture, writing content, researching a topic, or even writing code. All of these used to take hours or even days, and now they can be done in minutes. Going from an idea to a finished product has never been faster. In some cases, AI tools are even outperforming humans. It's easy to see why so many jobs that exist today might not exist in just a few years.
Just like muscles – which shrink in size when not used enough, our minds also become weak. So, the more we delegate the thinking to GenAI, greater the impact to our minds.
In this article, I share an interesting proposal to address this challenge.
How about we use the poison itself to create the cure. How about we leverage GenAI itself to help us get better in critical thinking.
Note: The article addresses to Software Engineers, but the ideas apply to knowledge workers in every domain.
I've spent considerable amount of time using GenAI past year at work. Also, having spent time with many power users, I begin to see an interesting trend.
Engineers are increasingly using GenAI to accomplish wide range of tasks, from advanced software engineering problems, to drafting a simple slack message.