Week 7: Ultra-Long Context Processing and Efficient Inference

Week 7: Ultra-Long Context Processing and Efficient Inference#

1. Paradigm Shift in Context Windows#

Over the past few years, the field of Natural Language Processing (NLP) has undergone dramatic changes driven by the advancement of Large Language Models (LLMs). At the center of this development lies the expansion of the ‘context window’—the amount of information a model can process and reference at once. As of 2025, we have entered an era of ‘ultra-long context revolution’ that goes beyond mere incremental improvements to redefine how LLMs are utilized. This lecture will explore in depth the core technologies driving this revolution, the latest flagship models, and the new paradigms and practical challenges that emerge from this transformation.

1.1 From Kilobytes to Megabytes – Quantitative Leap in Context#

In the early stages of LLM development, the context window was one of the model’s biggest constraints. Models from 2018 and 2019 had maximum context sizes of only 512 and 1,024 tokens respectively. This meant that the information a model could reference at once was limited to a few paragraphs, showing clear limitations in understanding long conversations or complex documents.

However, by 2024 and into 2025, these limitations have been dramatically overcome. Google’s Gemini and other latest models have begun offering context windows of hundreds of thousands to over 1 million tokens, suggesting that LLMs’ ‘working memory’ has expanded to the level of a book, or even a small library. To put 1 million tokens in perspective:

Approximately 50,000 lines of code (assuming 80 characters per line)
8 average-length English novels
Scripts from over 200 average-length podcast episodes

Furthermore, Meta’s Llama 4 handles 10 million tokens, while innovative models like Magic’s LTM-2-Mini demonstrate the ability to process an astonishing 100 million tokens (equivalent to 10 million lines of code), showing that the pace of technological advancement exceeds our imagination. This explosive growth in context windows is fundamentally changing how LLMs process information. While it was important to ‘compress’ knowledge within model parameters in the past, the core capability now lies in directly providing vast amounts of information within the context and enabling the model to search and reason through that information in real-time. In other words, the model’s role is transitioning from a knowledge repository to a ‘context-based information processing and reasoning engine’.

1.2 Capabilities of 2025 Flagship Models#

As of 2025, various technology companies are competitively launching flagship LLMs that support ultra-long context, leading the technological frontier. Representative models and their characteristics are as follows:

OpenAI GPT-5: A model that achieves a dramatic leap in intelligence beyond the previous generation GPT-4o, featuring a dedicated ‘reasoning’ module for solving complex problems. It demonstrates cutting-edge performance across various domains including coding, mathematics, and writing, with enhanced multimodal processing capabilities.
Google Gemini 2.5 Pro: A model that embodies the concept of a ‘thinking model,’ equipped with the ability to improve accuracy through internal reasoning processes before generating responses. It supports a 1 million token context window by default and is planned to expand to 2 million tokens soon. It records top performance in reasoning and coding benchmarks along with native multimodality capabilities that process text, code, images, audio, and video.
Anthropic Claude Sonnet 4: Supports a 1 million token context window and provides powerful performance that can process entire codebases consisting of over 75,000 lines of code or dozens of research papers in a single request. This opens up new possibilities particularly in software development and academic research fields.
Magic LTM-2-Mini: An innovative model that processes an astonishing 100 million tokens through an approach different from existing attention-based architectures. It has become a topic of discussion by claiming to show over 1000x efficiency compared to Llama series at the same performance level, heralding the emergence of fundamentally more efficient architectures beyond simple quantitative expansion.

The emergence of these models provides developers and users with powerful tools to handle unprecedented scales of data at once. The table below summarizes the key characteristics of these major models.

Table 1: Comparison of Major LLM Context Windows (2025)

Model Name	Company	Maximum Context Window	Key Features
GPT-5	OpenAI	Undisclosed (millions+)	Dedicated reasoning module, enhanced multimodality
Gemini 2.5 Pro	Google	1,000,000 (soon 2,000,000)	Thinking model, native multimodality
Claude Sonnet 4	Anthropic	1,000,000	Optimized for large codebases and document analysis
Llama 4	Meta	10,000,000 (estimated)	Open source ecosystem-based scalability
LTM-2-Mini	Magic	100,000,000	Sequence-dimension algorithm, ultra-efficient architecture

1.3 New Developer Paradigms: Beyond Simple Q&A#

The expansion of context windows goes beyond simply summarizing longer texts, enabling completely new application types and development paradigms. Some examples include:

Comprehensive Document Analysis: Models can now receive entire research papers, technical manuals, legal contracts, etc., as input at once, understanding the full context of documents and performing in-depth analysis. This can dramatically reduce the time for experts in legal, financial, and medical fields to review vast amounts of material.
Extended Conversational History: Chatbots or AI agents can now remember entire conversations spanning hours or even days. This solves the ‘memory loss’ problem of losing context in user interactions and provides a much more personalized and consistent conversational experience.
Repository-Level Code Understanding: By including entire code repositories in the context, models can support high-level development tasks such as complex bug fixes, large-scale refactoring, and code dependency analysis. This becomes an opportunity to elevate the capabilities of AI-based development tools like GitHub Copilot to the next level.
Cache Augmented Generation (CAG): A new paradigm that pre-computes frequently used documents or information and caches them as part of the prompt. This has the advantage of shorter latency compared to RAG (Retrieval-Augmented Generation) approaches that search external databases. Thanks to massive context windows, it has become possible to directly include such large-scale caches in prompts.

1.4 Hidden Costs – Inevitable Trade-offs#

It is important to clearly recognize that these remarkable advances are not a ‘silver bullet’. Ultra-long context comes with clear costs and trade-offs proportional to its power.

Increased Financial Costs: Most commercial LLM APIs charge based on the number of input tokens. Therefore, as context length increases, API call costs increase directly. For example, Anthropic’s Claude Sonnet 4 doubles the input token cost for prompts exceeding 200,000 tokens, reflecting the increased computational costs of large-scale context usage in their pricing policy.
Increased Response Latency: As the amount of input tokens increases, the speed of output token generation tends to slow down. This can be a critical disadvantage in applications where real-time interaction is important.

In conclusion, developers in 2025 face a new ‘context-computing-cost optimization’ problem. While providing more context can improve model accuracy and reasoning ability, this means accepting higher costs and slower response speeds. Therefore, a strategic approach to finding the optimal context size for specific tasks rather than ‘unconditionally large’ has become important. This opens the era of ‘context engineering’ that considers both cost and performance, beyond simple prompt engineering.

2. Core Technology I: Reimagining Attention Mechanisms#

The heart of the transformer architecture and the biggest obstacle to realizing ultra-long context was the computational complexity problem of the self-attention mechanism. To solve this problem, innovative research has been conducted at all levels of the computing stack, from hardware, algorithms, distributed systems, to model architecture. We will see that the current technological leap was achieved not by a single technology, but by the combination of multi-faceted approaches.

2.1 The \(O(n^2)\) Bottleneck of Standard Self-Attention#

The core of the standard self-attention mechanism is calculating the relationships between all token pairs within a sequence. When the sequence length is \(n\), this means generating and computing an attention score matrix of size \(n \times n\). This causes both computational complexity and memory requirements to increase quadratically (\(O(n^2)\)) with respect to sequence length.

This quadratic complexity causes a bottleneck phenomenon where computational load and memory usage explode exponentially even with sequences of just a few thousand tokens. This was the fundamental reason why past LLMs’ context windows remained at the level of hundreds to thousands of tokens. In short, the shorter the context window, the more controllable the model operation costs were.

2.2 Engineering Optimization: FlashAttention’s I/O Bottleneck Optimization#

FlashAttention is a landmark technology that produces exactly the same values as standard attention while maximizing speed and memory efficiency through engineering optimization. FlashAttention’s key insight is recognizing that the actual bottleneck in attention computation lies in data movement (I/O) between GPU memory hierarchies rather than the computation itself.

Specifically, instead of generating the entire \(n \times n\) attention matrix in HBM (High Bandwidth Memory) and reading it back, FlashAttention uses a tiling technique that divides the input into small blocks. Computation for each block is performed within the GPU’s fast on-chip memory (SRAM), and HBM access is minimized through kernel fusion that combines multiple computation steps into one. Thanks to this approach, FlashAttention achieves up to 2x faster speed and reduced memory usage while computing accurate attention results without approximation.

The important point is that FlashAttention does not change the fundamental \(O(n^2)\) complexity of attention. Instead, it is an engineering innovation that makes quadratic complexity computation much more efficient by maximizing the characteristics of hardware. This opened the path to practically using standard attention even in sequences of tens of thousands of tokens, which was unrealistic in the past.

2.2.1 Hands-on: Enabling FlashAttention in Hugging Face Transformers#

Hugging Face’s 🤗 Transformers library tightly integrates FlashAttention, allowing it to be easily activated with just the attn_implementation parameter when loading a model. This enables significant improvements in inference speed and memory efficiency with minimal changes to existing code. Below is how to selectively use standard attention and FlashAttention when loading an example model that supports FlashAttention-3:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Assuming GPU is Hopper architecture or above, and flash-attn library is installed

device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "openai/gpt-oss-20b" # Example model ID supporting FlashAttention-3

# 1. Load model with default attention implementation

tokenizer = AutoTokenizer.from_pretrained(model_id)
model_eager = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
print("Model with standard attention loaded.")

# 2. Load model with FlashAttention-3 implementation

# Internally uses vLLM's FlashAttention-3 kernel,
# which is automatically downloaded from the hub via the 'kernels' package.

try:
    model_flash = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        attn_implementation="kernels-community/vllm-flash-attn3" # Enable FlashAttention-3
    )
    print("Model with FlashAttention-3 loaded successfully.")
    print("Note: This requires a compatible GPU (e.g., NVIDIA Hopper series).")
except ImportError:
    print("FlashAttention is not installed or the environment does not support it.")
except Exception as e:
    print(f"An error occurred while loading with FlashAttention: {e}")

Example execution results:

Model with standard attention loaded.
Model with FlashAttention-3 loaded successfully.
Note: This requires a compatible GPU (e.g., NVIDIA Hopper series).

Checkpoint Questions#

What problems does FlashAttention solve in existing attention mechanisms, and how does the tiling technique work?
What are the main hardware acceleration features of the NVIDIA Hopper architecture utilized in FlashAttention-3?
What are the required conditions for using the attn_implementation=”kernels-community/vllm-flash-attn3” option, and what happens when these conditions are not met?

2.3 Algorithmic Optimization: Linear Time Approximation (Linear Attention)#

Attempts to fundamentally lighten attention computation itself have also been actively pursued. A representative example is Linear Attention techniques, which aim to approximate the attention mechanism to reduce complexity from \(O(n^2)\) to \(O(n)\) (linear).

The core idea is to replace the softmax function with a specific kernel function, changing the order of matrix multiplication computation. This allows achieving the same effect without directly computing the massive \(n \times n\) attention matrix. For example, by applying specific feature functions to Queries and Keys, attention score computation can be decomposed into two small matrix multiplications. This makes the total computational load linearly proportional to sequence length.

However, this efficiency improvement comes with a trade-off with accuracy. Since linear attention is an approximation technique, it does not guarantee exactly the same results as standard attention, and the resulting approximation error may partially degrade model performance. Flash Linear Attention (FLA), proposed in a similar context to FlashAttention, is an example that implements this linear attention concept as a hardware-friendly ‘chunkwise parallel algorithm’ to maximize efficiency.

2.4 Systemic Optimization: Distributed Attention with Ring Attention#

Ring Attention is a system-level innovation that distributes attention computation for long sequences across multiple devices (GPU/TPU), overcoming the limitations of single devices. Unlike common parallelization techniques used in large-scale model training such as model parallelism, data parallelism, and pipeline parallelism, Ring Attention corresponds to sequence parallelism. It divides long input sequences into multiple pieces and assigns each piece to different devices to perform attention in parallel.

Ring Attention’s operation mechanism is as follows:

Sequence Partitioning and Distribution: Very long input sequences are divided into multiple blocks, with each block assigned to different devices.
Conceptual Ring Formation: All devices participating in computation form a logical ring-shaped connection structure.
Block-wise Computation and Communication Overlap: Each device begins attention computation for its assigned Query block. When Key and Value blocks stored on other devices are needed, it receives the necessary KV blocks from the next device in the ring while simultaneously sending its own KV blocks to the previous device in the ring. This communication is designed to occur simultaneously (overlapped) with each device’s attention computation.
Communication Overhead Elimination: When each device finishes its computation, it immediately receives new KV blocks from the next device and continues computation. Since computation and communication proceed completely overlapped, almost no additional communication delay occurs within the ring.

Thanks to this structure, Ring Attention enables linear scaling of context size proportional to the number of devices. For example, using 1024 TPUs has been demonstrated to process over 10 million tokens of context with Llama 2 models. This is a victory of system architecture that opens the path to processing virtually ‘near-infinite’ context without approximating attention computation.

2.5 Architectural Innovation: Magic’s Sequence-Dimension Algorithm#

Magic’s LTM-2-Mini model presents the possibility of fundamental architectural changes beyond existing transformer-based attention. The ‘sequence-dimension algorithm’ used by this model boasts remarkable efficiency, being reportedly about 1000x cheaper in terms of computational load (FLOPs) compared to Llama 3.1’s attention mechanism for 100 million token context.

However, the most shocking innovation lies in memory usage, particularly the solution to the KV cache problem. In standard transformer models, KV cache is the space for storing Key and Value vectors of all previous tokens for attention computation. For large Llama models with 100 million token context, storing this KV cache alone requires approximately 51TB of VRAM, which translates to needing about 638 H100 GPUs per user. This commonsensically impossible requirement was a new barrier blocking the practical implementation of ultra-long context.

In contrast, LTM-2-Mini claims to operate with only a tiny fraction of a single H100 GPU’s memory for the same 100 million token context. This suggests that LTM-2-Mini processes sequence information in a completely new way, departing entirely from existing KV cache mechanisms. Overcoming the ‘memory barrier’ of KV cache is the next core challenge of the ultra-long context era, and LTM-2-Mini’s approach presents one solution to this.

Meanwhile, a new evaluation method called ‘HashHop’ has also been proposed to accurately assess the performance of such innovative models. While existing ‘Needle in a Haystack’ tests may rely on semantic hints, HashHop uses random and incompressible hash values to force models to accurately store and retrieve necessary information from the entire context without semantic clues. This allows for more rigorous measurement of models’ actual information processing capabilities.

The table below compares the technical characteristics of various attention mechanisms discussed in this section.

Table 2: Comparison of Long-Context Attention Mechanisms

Mechanism	Computational Complexity	Memory Complexity	Accuracy	Core Principle
Standard Attention	\(O(n^2)\)	\(O(n^2)\)	Exact	Direct computation of relationships between all token pairs
FlashAttention	\(O(n^2)\)	\(O(n)\) (practical)	Exact	I/O-aware optimization (tiling, kernel fusion)
Linear Attention	\(O(n)\)	\(O(n)\)	Approximate	Softmax approximation through kernel functions
Ring Attention	\(O(n^2/N)\)	\(O(n/N)\)	Exact	Sequence parallelism + communication-computation overlap (\(N\) devices)

3. Core Technology II: Extending Positional Encoding#

Since transformers cannot inherently understand the order or relative positions of tokens, separate positional encoding is needed to provide this information. Rotary Positional Embeddings (RoPE) is widely used as an effective method for encoding such positional information, but it has an ‘extrapolation’ problem where performance degrades for long sequences not seen during training. Solving this positional extrapolation problem is also a core challenge for realizing ultra-long context.

3.1 RoPE’s Limitations – The ‘Extrapolation’ Problem#

RoPE multiplies rotation matrices generated from each token’s absolute position with Query and Key vectors to reflect relative positional information between tokens in attention computation. This approach is very effective within the sequence length range that the model encountered during training.

However, problems occur when the model processes attention for positions beyond the maximum length experienced during training (e.g., 4096 tokens), such as the 10,000th token. For positions the model has ‘never seen’, it cannot properly interpret the corresponding positional embeddings, causing token distance information to become scrambled or attention scores to become unstable. This results in a sharp degradation in model performance, which is exactly RoPE’s extrapolation problem.

3.2 LongRoPE – Sophisticated Scaling Solution#

LongRoPE is a cutting-edge technology that can extend the context window of existing pre-trained LLMs to over 2 million tokens with minimal fine-tuning. LongRoPE’s success is based on three core mechanisms that deeply understand the characteristics of positional information and handle it with sophistication, going beyond naive approaches that simply mathematically increase positional values.

3.2.1 Mechanism 1 – Leveraging Non-Uniformity#

LongRoPE’s important insight is that uniform interpolation treating all positions and all RoPE dimensions equally is not optimal. Instead, it actively leverages two forms of ‘non-uniformity’:

Variable Scaling by RoPE Dimension: Each dimension of RoPE embeddings rotates at different frequencies. LongRoPE recognizes that some dimensions are more important for preserving positional information and applies different scaling factors for each dimension. Methods like evolutionary search algorithms are used to find optimal non-uniform scaling combinations.
Differential Application by Token Position: Early tokens in the sequence play a very important role in setting the overall context (this is also called the ‘attention sink’ phenomenon). LongRoPE takes a differentiated strategy of applying less interpolation or no interpolation at all to these sections to maximally preserve the positional information of early tokens.

Through such sophisticated and non-uniform approaches, LongRoPE achieved remarkable results of extending context windows up to 8x without separate fine-tuning.

3.2.2 Mechanism 2 – Progressive Extension Strategy#

To reach millions of tokens in length, direct fine-tuning from the beginning to that length brings two problems. First, computational costs increase astronomically. Second, obtaining such long, high-quality training data is very difficult.

LongRoPE uses an efficient 2-stage progressive extension strategy to solve these problems. Like a kind of curriculum learning, it guides the model to gradually adapt to long contexts:

Stage 1 – Intermediate Length Fine-tuning: First, fine-tune the model to a manageable intermediate length context (e.g., 256k tokens). This allows the model to develop a basic sense of ‘long context’.
Stage 2 – Interpolation to Final Length: Apply positional interpolation once more to the model adapted to 256k to extend to the final target length (e.g., 2048k tokens). Since some experience with long context has already been accumulated, additional interpolation is much more stable and effective.

This progressive approach shows that carefully designed training strategies along with architectural innovations are essential. In other words, the success of ultra-long context models depends on the close interaction between architecture and training process.

3.2.3 Mechanism 3 – Short Context Performance Restoration#

In the process of extremely extending the context window, side effects may occur where the model’s performance in originally short contexts (e.g., 4k, 8k) degrades. LongRoPE goes through a final adjustment stage to prevent this. By re-searching and applying scaling factors optimized for short context lengths to the extended model, it ensures that the model maintains both long context processing capability and existing short context performance.

3.3 Hands-on: Context Extension Example Using LongRoPE#

Open source implementations of the LongRoPE methodology are available, allowing easy extension of existing pre-trained LLM context windows. The example below shows the process of extending a model with base 4k context using LongRoPE to 2048k (approximately 2.1 million) token context.

# 1. Setup: Define model dimensions and target context length

data_path = "path/to/your/dataset"
d_model = 512
n_heads = 8
num_layers = 6
base_length = 4096 # Original model's maximum context length (4k)
target_length = 2048 * 1024 # Target context length (2048k, approximately 2.1M tokens)

# 2. Data loading and LongRoPE model initialization

data = load_data(data_path)
model = LongRoPEModel(d_model, n_heads, num_layers, base_length)

# 3. Context window extension through LongRoPE

model = model.extend_context(data, target_length)

# 4. Extended model testing: Process random input of target_length

input_ids = torch.randn(2, target_length, d_model)
output = model(input_ids)
print(output.shape) # Expected output shape: (batch_size, target_length, d_model)

In the above code, LongRoPEModel is initialized and the context window is extended through the extend_context() method, which fine-tunes the pre-trained model using progressive interpolation strategy. For example, a model that originally had base_length=4096 can now process up to target_length=2097152 (2,097,152) tokens after going through LongRoPE. Printing the final output shape shows that it successfully processed a sequence of approximately 2.1 million length with batch size 2.

Note: Applying LongRoPE requires considerable computational resources, time, and appropriate data curriculum. However, this technology has significant practical value in that it enables utilization of ultra-long context without increasing the model’s parameter count.

4. RAG vs Ultra-Long Context: 2025’s Debate and Integration#

With the opening of the 1 million token era, heated debates arose in the NLP community. If models can put entire vast knowledge bases into their context windows, wouldn’t RAG (Retrieval-Augmented Generation) approaches that search and add external information to prompts become unnecessary? The discourse was whether ultra-long context LLMs, if sufficiently intelligent, could solve problems by reading all necessary knowledge at once. Particularly with the rise of AI agents, expectations even emerged that “agents would replace RAG”.

4.1 The Beginning of the Debate – “Is RAG a Relic of the Past?”#

The dramatic expansion of context windows seemed to herald the end of RAG. RAG’s core role was to find and provide latest information or internal documents that models don’t know to the context through external search. However, now that models can ‘read’ entire documents spanning hundreds of pages, arguments emerged questioning whether there’s still a need to search for information externally. In some corners of the industry, views even appeared that “ultra-long context + agent combinations would replace RAG”.

4.2 The Necessity of RAG – Limitations of Naive Ultra-Long Context#

However, the claim that “RAG is dead” overlooks various problems that occur when using ultra-long context without any strategy. Latest research has revealed the following limitations:

‘Lost in the Middle’ Problem: Models tend to remember information at the beginning and end of long contexts relatively well, but miss or ignore information in the middle. This phenomenon where model performance on long inputs forms a U-shaped curve shows that simply throwing information into the context doesn’t guarantee that the model will effectively utilize all that information.
‘Hard Negatives’ Problem: In RAG pipelines, performance improves as the number of retrieved documents increases, but beyond a certain point, it actually decreases. This is because documents in search results that are superficially similar to the question but irrelevant to the actual answer (hard negatives) confuse the model and degrade answer quality. This is an example showing that unconditionally adding many documents is not the answer.
Cost and Latency: As mentioned earlier, putting millions of tokens of context into the model for every query is very expensive and causes significant response delays. In contrast, RAG is much more efficient in most cases because it selects only necessary information and provides it to the model in a much smaller context.
Boundaries of Knowledge: Even if 10 million tokens of context can be included, it can still only contain finite scope of information. RAG has the fundamental advantage of being able to access virtually infinite external knowledge bases (e.g., the entire internet, all internal company documents).

In summary, even ultra-long context models “don’t know what they don’t know”. Latest knowledge or specialized expert information not embedded in model parameters must still be brought from external sources. Additionally, the structural limitations and cost problems inherent in long contexts cannot be overlooked.

4.2.1 Hands-on: RAG-based QA Pipeline Using Haystack#

In industry, open-source frameworks like Haystack are widely used to implement RAG. Haystack enables easy construction of end-to-end QA systems consisting of document store + retriever + reader/generator models through flexible pipeline configuration. Below is a simple document-based QA pipeline example. It shows the process of putting one document into an in-memory document store and extracting answers using a BM25-based Retriever and pre-trained Reader model.

pipeline = Pipeline()
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=reader, name="Reader", inputs=["Retriever"])

# 4) QA execution

query = "Who is the director of Squid Game?"
result = pipeline.run(query=query, params={"Retriever": {"top_k": 5}, "Reader": {"top_k": 1}})
print(result['answers'][0].answer)

In the above code, a simple pipeline was built by putting one document into an in-memory document store and combining a BM25-based Retriever with an Electra Reader trained on Korean KorQuAD data. When a query is input to pipeline.run(), the Retriever finds the top 5 documents, and the Reader extracts and returns the answer from among them. For example, the above question would yield the correct answer “Hwang Dong-hyuk”.

Haystack’s strength lies in how easily components can be replaced or extended. It’s possible to switch to a Dense Retriever or attach a generation model like GPT-3 as a Generator instead of a Reader. It can also support complex reasoning scenarios by configuring multiple nodes sequentially/parallelly in the middle, like multi-hop QA.

In actual industrial settings, there are many cases of using Haystack to construct RAG pipelines for domain document search + QA services or injecting external knowledge into chatbots. In summary, Haystack is a framework that ties together search engines and NLP models, providing a tool that enables construction of powerful document-based QA systems with relatively little code.

Exercise Questions: What changes would be needed to use a Dense Retriever (e.g., DensePassageRetriever) instead of BM25Retriever in the above Haystack example? Also, think about what advantages and disadvantages there would be in attaching a GPT-series generation model as a Generator instead of a Reader. Finally, experimentally explain what effect adjusting the top_k values passed to params during pipeline.run() calls would have on the results.

4.3 2025’s Integration – RAG as AI Agent Memory#

Ultimately, the framework of “RAG vs long context” was a false dichotomy. The latest 2025 paradigm integrates the two technologies not as opposing forces, but as mutually complementary components of AI agent cognitive architecture.

Ultra-Long Context = Short-Term Working Memory: A space where agents temporarily store and process vast amounts of information directly related to currently performed tasks.
RAG = Structured Long-Term Memory: A systematic memory that agents persistently accumulate and manage as a knowledge repository, searchable and retrievable whenever needed.

Particularly, RAG as a ‘long-term memory system’ is evolving beyond simple information retrieval to integrate more complex functions like human memory:

Indexing: Beyond vector DB-based similarity search, it features advanced indexing structures that enable multi-dimensional search by topic, time, etc.
Forgetting: It intentionally deletes old or invalid information to secure memory capacity and reduce noise during search.
Consolidation and Refinement: It restructures stored information by summarizing or organizing related knowledge into knowledge graph forms. This aids information retrieval and enables deeper semantic understanding.

4.4 Evolved RAG Architecture: The Rise of Graph-Based Reasoning#

HippoRAG is an example of the latest RAG framework implementing such long-term memory system concepts. This framework, inspired by the memory formation principles of the human hippocampus, shows new possibilities for RAG.

HippoRAG’s Architecture:
Offline Knowledge Graph Construction – It uses LLMs to analyze entire document corpora and pre-builds a Knowledge Graph (KG) representing relationships between documents. This is similar to the process where the human brain’s neocortex stores information and the hippocampus manages its index.
Online Search and Reasoning – When a user’s question comes in, it uses the question’s core concepts as seeds to perform a Personalized PageRank algorithm on the knowledge graph. Through a single graph traversal, it integrates and reasons about related information scattered across multiple documents to find the core content needed for the query.
HippoRAG’s Advantages: This approach demonstrates excellent performance even on complex questions that require multiple steps to reach answers, like multi-hop question answering. Also, compared to the iterative search (Query Rewriting) methods commonly used in RAG, it derives answers through single LLM calls, making it much faster and cheaper. The follow-up research HippoRAG 2 further enhances associative reasoning and sense-making capabilities and is being applied to complex knowledge integration scenarios.

Such developments show that we are moving away from viewing LLMs as simple input-output functions toward building sophisticated cognitive architectures around them. With ultra-long context handling ‘working memory’ and evolved RAG handling ‘long-term memory’, how to design the organic interaction between these two has emerged as a core challenge in AI agent development. This signifies the arrival of an era of designing AI’s memory systems and thinking structures, beyond simple information provision problems.

5. Practical Considerations: The Gap Between Benchmarks and Reality#

While ultra-long context technology is rapidly advancing, to apply it to actual applications, we must clearly understand the gap between advertised specifications and realistic limitations. Latest benchmarks play an important role in measuring this gap and providing practical guidelines to developers.

5.1 Diversification of the 2025 LLM Ecosystem#

While ultra-long context is clearly the most prominent trend in 2025 LLM technology, it is not the only direction. The LLM ecosystem is developing in multiple branches to meet diverse requirements and coexisting with other major trends:

Multimodality: Latest models like Google’s Gemini 2.5 Pro natively understand and process various forms of data including images, audio, and video, not just text. This means the ability to integrate and reason with visual and auditory information, beyond text understanding capabilities.
Smaller, Specialized Models: As an opposite flow to large models, small and efficient models optimized for specific domains or tasks are gaining attention. These models have the advantages of fast response speed, low operational costs, and edge deployment capability on smartphones or IoT devices.
Agentic Workflows: Autonomous AI agent technology that uses LLMs as core engines to plan complex tasks and solve problems through multiple steps is spreading. OpenAI’s GPT-5 is also enhanced in tool usage and logical planning execution compared to GPT-4, effectively supporting such workflows.

5.2 The Need for Better Evaluation – Emergence of Long-Context Benchmarks#

As models’ context processing capabilities have dramatically improved, existing benchmarks have become insufficient to adequately evaluate them. This has led to the emergence of new evaluation frameworks specialized for long context.

LongBench v2: A benchmark composed of challenging problems including vast contexts up to 2 million words, evaluating models’ ability to perform deep understanding and reasoning within long contexts. Tasks include answering comprehensive questions after being given multiple long papers, or summarizing plot after reading an entire novel.
SWE-Bench: Utilizing software issues from actual GitHub repositories, it measures models’ ability to solve problems within complex and long code contexts similar to realistic development environments. This provides practical indicators for examining models’ long code understanding and debugging capabilities.

5.3 Reality Check – Findings from the LONGCODEU Benchmark#

The LONGCODEU benchmark published in 2025 is an important study that reveals the realistic limitations of current long-context LLMs, particularly focusing on ‘long code understanding’ capabilities.

Key Finding: LONGCODEU experimental results showed that even the most advanced LLMs experience sharp performance degradation when code length exceeds 32,000 tokens. This means that even though models have advertised context windows of 128k~1M tokens, they cannot function properly in complex reasoning beyond approximately 32k.
Most Difficult Task: Particularly, inter-code unit relation understanding was found to be the most difficult for LLMs. This means models are weak at understanding how different functions, classes, and files interact within large codebases.

These findings reveal the core challenge of the ultra-long context era. There is a clear difference between ‘advertised context windows’ and ‘actually reasoning-capable windows’. While current technology has secured the ability to receive vast amounts of information as input, the ability to perform deep and consistent reasoning across all that information is still limited. In short, while models’ ability to simply find information may have improved through context expansion, their ability to think like humans within that context has not yet caught up with that speed. Bridging this gap will be an important direction for next-generation LLM research.

5.4 Conclusion – Strategic Recommendations for Industry Developers#

Integrating 2025’s cutting-edge technology trends and realistic limitations, I will conclude this lecture by presenting strategic recommendations for industry developers to effectively utilize ultra-long context technology.

Be Selective: There’s no need to unconditionally use the entire maximum context window just because it’s large. It’s important to select only information truly necessary for the task to include in the context and reduce unnecessary token waste.
Structure Intelligently: To mitigate the ‘lost in the middle’ problem, it’s advantageous to place the most important information at the beginning or end of the context. Also, using document section divisions or summaries to arouse the model’s attention is another method.
Monitor and Benchmark: During application development, continuously measure response speed, output quality, token costs, etc., to find the optimized context length for the specific task. In some cases, 16k or 32k may be sufficient, and anything beyond that may be over-specification.
Embrace Hybrid Approaches: Rather than going all-in on one technology paradigm, consider hybrid architectures that combine the advantages of each technique. For example, use ultra-long context as the agent’s ‘working memory’, Cache Augmented Generation (CAG) as ‘high-speed cache’ for frequently used data, and evolved RAG (HippoRAG, etc.) as ‘long-term memory that searches and reasons’ necessary information from vast external knowledge. Such configurations can provide balanced solutions in terms of performance, cost, and response speed.

The ultra-long context revolution has elevated LLM capabilities to new dimensions, but it also demands more sophisticated and strategic utilization methods from us. Maximizing the potential of technology while clearly recognizing its limitations and the wisdom to complement them will determine the success or failure of future AI application development.

Checkpoint Questions#

Meaning of Context Window Expansion: What fundamental changes has the possibility of context windows over 1 million tokens brought to LLM utilization methods? What are the differences between the past ‘knowledge compression’ approach and the current ‘context-based information processing’ approach?
Core of FlashAttention: What is the core principle that allows FlashAttention to improve performance while producing the same results as standard attention? How do tiling and kernel fusion techniques solve I/O bottlenecks?
LongRoPE’s Innovation: What are the three core mechanisms that LongRoPE used to solve RoPE’s extrapolation problem? Why is the progressive extension strategy important?
RAG vs Ultra-Long Context: Why is RAG still necessary despite the possibility of ultra-long context? What are the ‘lost in the middle’ problem and ‘Hard Negatives’ problem?
Realistic Limitations: What are the realistic limitations of ultra-long context models revealed by the LONGCODEU benchmark? Why does the gap between advertised context windows and actually reasoning-capable windows occur?
Strategic Utilization: What are the four strategic recommendations for effectively utilizing ultra-long context technology? What problems does each one aim to solve?

References#

Meibel (2025). Understanding the Impact of Increasing LLM Context Windows. (Accessed Sep. 30, 2025)
Google AI Developers. Long context – Gemini API Docs. (Accessed Sep. 30, 2025)
Lablab.ai. LTM-2-mini AI technology page. (Accessed Sep. 30, 2025)
Shakudo (2025). Top 9 Large Language Models as of September 2025. (Accessed Sep. 30, 2025)
Google Cloud. Generative AI on Vertex AI – Gemini 2.5 Pro. (Accessed Sep. 30, 2025)
Google DeepMind (2025). Gemini 2.5: Our most intelligent AI model – The Keyword. (Accessed Sep. 30, 2025)
Google I/O 2025 – Updates to Gemini 2.5 (Accessed Sep. 30, 2025)
Anthropic (2025). Claude Sonnet 4 now supports 1M tokens of context. (Accessed Sep. 30, 2025)
Lablab.ai (2025). How Magic.dev’s LTM-2-mini is Redefining AI’s Ability to Handle Vast Contexts. (Accessed Sep. 30, 2025)
Elinext (2025). The Future of Large Language Models – Trends. (Accessed Sep. 30, 2025)
GoPenAI (2024). A Visual Guide to FlashAttention, Linear Attention, and Efficient Transformers. (Accessed Sep. 30, 2025)
Hailey Schoelkopf (2024). Linear Attention Fundamentals. (Accessed Sep. 30, 2025)
Dao et al. (2024). The I/O Complexity of Attention, or How Optimal is FlashAttention? arXiv:2402.07443
Li et al. (2025). LONGCODEU: Benchmarking Long-Context Language Models on Long Code Understanding. arXiv:2503.04359
Aussie AI (2025). Ring Attention. (Accessed Sep. 30, 2025)
OpenReview (2025). RingAttention with Blockwise Transformers for Near-Infinite Context. (Accessed Sep. 30, 2025)
Wang et al. (2023). Ring Attention with Blockwise Transformers for Near-Infinite Context. arXiv:2310.01889
Magic.dev (2025). 100M Token Context Windows. (Accessed Sep. 30, 2025)
Hopsworks (2025). What is RoPE Scaling. (Accessed Sep. 30, 2025)
Ding et al. (2024). LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens. arXiv:2402.13753
LongRoPE GitHub Repository – Implementation of LongRoPE (2024)
Reddit (2025). What are your thoughts on the ‘RAG is dead’ debate? (Accessed Sep. 30, 2025)
Wu et al. (2025). U-NIAH: Unified RAG and LLM Evaluation for Long Context Needle-In-A-Haystack. arXiv:2503.00353
OpenReview (2025). Long-Context LLMs Meet RAG: Overcoming Challenges…
Su et al. (2024). HippoRAG: Neurobiologically Inspired Long-Term Memory for LLMs. arXiv:2405.14831
OSU-NLP-Group (2024). HippoRAG GitHub Repository. (Accessed Sep. 30, 2025)
PrajnaAI (2025). LLM Trends 2025: A Deep Dive into the Future. (Accessed Sep. 30, 2025)
LongBench v2 (2025). LongBench v2 Benchmark Suite. (Accessed Sep. 30, 2025)
Evidently AI (2025). 10 LLM coding benchmarks. (Accessed Sep. 30, 2025)
Li et al. (2025). LONGCODEU: Benchmarking Long-Context LMs on Long Code Understanding. ACL Anthology

Week 7: Ultra-Long Context Processing and Efficient Inference

Contents

Week 7: Ultra-Long Context Processing and Efficient Inference#

1. Paradigm Shift in Context Windows#

1.1 From Kilobytes to Megabytes – Quantitative Leap in Context#

1.2 Capabilities of 2025 Flagship Models#

1.3 New Developer Paradigms: Beyond Simple Q&A#

1.4 Hidden Costs – Inevitable Trade-offs#

2. Core Technology I: Reimagining Attention Mechanisms#

2.1 The \(O(n^2)\) Bottleneck of Standard Self-Attention#

2.2 Engineering Optimization: FlashAttention’s I/O Bottleneck Optimization#

2.2.1 Hands-on: Enabling FlashAttention in Hugging Face Transformers#

Checkpoint Questions#

2.3 Algorithmic Optimization: Linear Time Approximation (Linear Attention)#

2.4 Systemic Optimization: Distributed Attention with Ring Attention#

2.5 Architectural Innovation: Magic’s Sequence-Dimension Algorithm#

3. Core Technology II: Extending Positional Encoding#

3.1 RoPE’s Limitations – The ‘Extrapolation’ Problem#

3.2 LongRoPE – Sophisticated Scaling Solution#

3.2.1 Mechanism 1 – Leveraging Non-Uniformity#

3.2.2 Mechanism 2 – Progressive Extension Strategy#

3.2.3 Mechanism 3 – Short Context Performance Restoration#

3.3 Hands-on: Context Extension Example Using LongRoPE#

4. RAG vs Ultra-Long Context: 2025’s Debate and Integration#

4.1 The Beginning of the Debate – “Is RAG a Relic of the Past?”#

4.2 The Necessity of RAG – Limitations of Naive Ultra-Long Context#

4.2.1 Hands-on: RAG-based QA Pipeline Using Haystack#

4.3 2025’s Integration – RAG as AI Agent Memory#

4.4 Evolved RAG Architecture: The Rise of Graph-Based Reasoning#

5. Practical Considerations: The Gap Between Benchmarks and Reality#

5.1 Diversification of the 2025 LLM Ecosystem#

5.2 The Need for Better Evaluation – Emergence of Long-Context Benchmarks#

5.3 Reality Check – Findings from the LONGCODEU Benchmark#

5.4 Conclusion – Strategic Recommendations for Industry Developers#

Checkpoint Questions#

References#