Context Engineering: Introduction and Techniques When we think about designing an AI agent system, we spend hours scrutinising model capabilities, trying to limit their weak points through prompt engineering techniques and seeking to reduce hallucinations and erroneous behaviour as much as possible. Yet we often overlook a crucial aspect that depends heavily on our design and architecture: context.

What Is Context in the World of Language Models?

Context is everything the model sees as input on each inference (messages, instructions, and injected data) that conditions its output. This information is usually composed of various sources, such as:

  • System prompt: high-level instructions, not visible to the user, that define behaviour, policies, style, and procedures. They usually take priority over other sources and developers can add intermediate instruction layers on top of those set by the owning company. It tends to be very specific and detailed.

  • User messages and attachments (text, images, audio, video, documents), limited by the model’s capabilities and context window.

  • Memory: information existing in the current conversation, previous conversations, or user preferences. It only becomes part of the context when it is explicitly re-injected into the model call.

  • Prompt enrichment: external information (public or private; structured, semi-structured, or unstructured) added to guide and improve the response. Like the system prompt, it is usually invisible to the user. Depending on how it is retrieved, it can be:
    • Externally provided information: a system external to the model retrieves, filters, and transforms relevant content and injects it into the prompt. Sources can be internal (tickets, manuals, wikis, knowledge bases, code, etc.) or public (websites, papers, FAQs). Retrieved from structured data (SQL, tables), semi-structured (JSON, XML, HTML, CSV), or unstructured (free text, PDFs). Lexical techniques (BM25, TF-IDF) and semantic techniques (embeddings and vector databases) are used, often in hybrid fashion. This is the typical RAG (Retrieval-Augmented Generation) architecture.
    • Information autonomously retrieved by the model (tool calling): during inference, the model decides to call external tools and re-injects their results into the context. It can use business APIs, databases, web search engines, file stores, and transformation utilities (parsers, converters, table extractors, etc.). In this domain, the Model Context Protocol (MCP) standardises how the model discovers and uses external tools and resources.
  • Chains of thought (chain-of-thought): although not part of the context by default, they guide the model’s internal reasoning. They only become context if they are exposed and explicitly re-injected (for example, as a scratchpad in multi-step pipelines).

Limitations of Context Windows

Although models are trained on trillions of tokens, which ultimately translate into billions of neural network parameters, the context window is considerably more limited. We must always bear in mind that the context window is finite and that it pays to synthesise and prioritise information in brief, concise, unambiguous messages. This is due to several reasons:

  • Computational cost or complexity: the attention mechanism of transformers has an algorithmic complexity of O(n²), meaning that the computational cost grows quadratically with the size of the context.
  • Saturation of the attention window: although current models can have context windows of up to 2 million tokens, in practice they are not able to retrieve information consistently and accurately in such large context windows. Benchmarks such as Needle-in-a-Haystack NIAH (inserts a “needle” into a long “haystack” and measures whether the model retrieves it depending on position and length), RULER (arXiv) (estimates the “effective context” with retrieval, multihop, aggregation, and QA tasks, varying context length), and LongBench · ∞Bench (multi-task evaluation with extended contexts and distractors) show that the “effective context” is usually considerably smaller and depends on the position of the evidence, the noise, and the task itself. In this regard, the “lost in the middle” phenomenon has already been demonstrated: LLMs tend to focus attention at the beginning and end of the context, degrading in the middle section.
  • It increases the probability of introducing contradictory, ambiguous, or misleading information that gives rise to inconsistent, erroneous, or hallucinated responses.

What Is Context Engineering and How Does It Differ from Prompt Engineering?

Both are techniques of refactoring, enrichment, and pattern alteration to achieve a better outcome. While prompt engineering focuses mainly on system prompt techniques and user messages, context engineering focuses on what minimal, high-value information enters the context window at each step with the goal of maximising response quality and agent robustness.

A Constant Evolution: From RAG to MCPs

Enriching the prompt has evolved in layers, and today different approaches coexist that complement rather than replace each other. From the beginnings to the present:

  • Prompt engineering: the early days. Initially, the most common technique was playing with the system prompt and user messages to achieve the desired behaviour. However, limitations such as the knowledge cutoff (the model’s knowledge cut-off date) and the inability to use specific information tailored to particular use cases remained a problem. Even today, a good hardcoded system prompt remains the foundation for fixing behaviour, style, and limits. It is advisable to include behaviour examples (well chosen, representative, and not excessive) to anchor the model without introducing noise or fragility.

  • RAG: the first major wave. Retrieval-Augmented Generation consists of introducing relevant fragments (chunks) from documentation (private or public) that has previously been indexed in embeddings stored in vector databases, to constrain the response to the provided corpus. It represented a gain in control and up-to-date knowledge without needing to retrain the model, and is usually faster than reading complete documents.

  • Rerankers and (omni)multimodality: refining RAG. Rerankers improve fragment selection with a second filter (often cross-encoder models; in other cases, an LLM) to prioritise query adherence and avoid incoherence. Omnimodality/multimodality enables information retrieval from text, images, audio, or video, expanding the capabilities of existing systems.

  • Wider context windows and falling token costs: a paradigm shift. This advance made it possible to pass complete documents, without having to break them into chunks, when preserving global structure and coherence is key. Even so, more context means higher computational cost and risk of “lost in the middle”, so it is worth evaluating the cost/benefit.

  • Agents with tools: when agents started using tools to search and filter autonomously, we moved to an even more selective approach. Metadata (titles, paths, types, timestamps, tags, and permissions) are decisive for finding what is relevant; the model itself decides, after selective reading (e.g., partial views, searches, summaries), whether to continue reading, look for new information, or discard parts, mitigating the biases of chunks.

  • MCPs: the next level of integration. With Model Context Protocol (MCP), the definition of functions/methods and descriptions lives outside the context (in the server) as a tool catalogue; the LLM discovers and chooses what to use at runtime. This simplifies maintenance, reduces tokens, and favours more robust orchestration between agent and services.

Conclusion: there is no substitution, only complementarity. A clear system prompt is necessary; RAG provides control and freshness; rerankers and multimodality refine relevance; long contexts preserve coherence; tools/MCP enable more autonomous agents. The key to context engineering is combining these pieces according to the task, the token budget, and the precision and latency requirements.

Some Context Engineering Techniques

To close this article, I would like to share some context engineering techniques with you. Some are inspired by this article from Anthropic, which I highly recommend reading, and others are personal, based on experience or other readings.

  1. Recommendations for RAG systems:
    • Query reformulation: On the query the user has formulated, it is generally advisable to launch several parallel queries with different reformulations or alternatives to improve the vector search in fragment retrieval. Using one or several LLMs to do this independently usually improves results over the user’s direct query. Moreover, technically rephrasing the user’s question using domain jargon tends to be more effective. Here, combining lexical techniques (BM25, TF-IDF) in hybrid approaches where the first filter is lexical and the second semantic can be interesting.
    • Weighting and condensing retrieved information: introducing a reranker system that selects the most appropriate fragments is a good technique when we introduce reformulations of the user’s initial query. This ensures we choose the fragments most closely related to the original question, whether through a voting system or a score. Another good practice can be placing an agent that, among the chosen fragments, produces a summary or consolidates the information to feed the context of the answering agent more cleanly, avoiding incoherence and diffuse information. It is especially useful when there are multiple sources and we do not want to overload the main agent’s context.
  2. “Garbage collector” or compaction: this refers to the action of cleaning the context at each interaction, or set of interactions, to remove irrelevant information, focus on particular matters, and maintain consistency across long conversations or processes. It consists of asking the LLM for a summary of all information incorporated into the context and consolidating that summary with the additional information provided at each new iteration, using this summary as the new context for the next step instead of all the information previously provided. Being more concise, it optimises token usage, is faster, and more coherent.

  3. Structured external memory: Instead of including vast amounts of detailed information in an oversized system prompt, this involves using external storage files (in Markdown format preferably) to compile summaries, notes, processes, or results developed during the interaction, with the aim of loading them ad hoc dynamically into the context. In this way, with information compartmentalised, we avoid incoherence, save context, and introduce more specific behaviours for each scenario. It can also serve as the agent’s storage system to note progress or results without loading the context, retrieving them more selectively. It is important that the system maintains and updates a root file with the up-to-date structure, existing files, and a description of what each contains in order to know which one to load at each moment. For those experienced in this area, it is a system similar to Cursor’s rules or the CLAUDE.md file in Claude Code.

  4. Hierarchical orchestration of subagents with more bounded and specialised contexts. Instead of condensing all information in a single agent, use an agent system to distribute and fragment context among them, so that each one has information about the same aspect but in a reduced size within its window.

  5. Council of wisdom: in some scenarios it can be beneficial to introduce several agents that are forced to search for information or obtain results under different techniques and approaches, channelling their viewpoints into a common space and cross-checking information, initiating refinement loops that lead to consensus responses which are then provided to the agent responsible for responding or continuing the process. The decision can be collective (though this can lead to infinite loops and excessive consumption) or delegated to a reviewer without connection between agents.

  6. Providing tools for autonomous information retrieval. Rather than relying on a deterministic programmed system, letting the LLM itself decide when to search, where, and how, in accordance with prior instructions. Here, MCPs are important for achieving access to external services where relevant information can be found without having to think in very closed flows. The fact that the system itself can carry out SQL queries, graph network or even vector queries, or API calls, relieves information overload and allows maintaining a continuous, coherent thread of actions. Other tools, such as code execution or terminal access, make it easier to have verified, tested information with which to guarantee a more precise response. However, it is often advisable to omit internet access from the agentic system to avoid contradictory or contaminated information, or in scenarios where internal information must prevail. As a tip, one alternative could be to control or restrict access by providing the system with lists of authorised information sources known to be safe and not in conflict with internal data (whitelists).

  7. Using the right model type for each task: reasoning models can be useful for decision-making because, by having multiple lines of thought, they can cross-check and refine information. But they are not always advisable. For simple summaries or extractions, non-reasoning models are much faster, cheaper, and do not incur the risk of over-adjustment. It is also important to assess model capabilities and create a structure that makes an appropriate mix between the strengths and weaknesses of each one. Adjust parameters to achieve greater determinism or creativity (temperature, top_p) and for more detailed or concise responses (max_tokens, verbosity). In other cases, traditional Machine Learning solutions may be more effective, introduce less latency, and be cheaper than an LLM. There is no need to use a sledgehammer to crack a nut.

I hope you found this content informative and useful. If so, I encourage you to share it and comment on it. I am always eager to engage in conversations with other enthusiasts on these topics. Thank you for reading!