# Transformer, Mamba, RWKV, Jamba Architecture Q&A

## Transformer Architecture

**Q:** What are the main advantages of Self-Attention in Transformer? Also, where does the bottleneck occur during inference?

**A:** The biggest advantage of **Self-Attention (self-attention)** mechanism, which is the core of **Transformer**, is that it can **process input sequences in parallel**. Unlike RNNs, there is no recurrent structure, so all token relationships are calculated simultaneously, allowing efficient learning of **long dependencies** and fast training speed. However, **bottlenecks** occur during the **inference** stage because **attention with all previous tokens** must be calculated every time a new token is generated. For example, if the sequence length is $L$, when generating the current token, attention with all $L$ past tokens must be computed, so generating one token requires $O(L)$ operations, and overall, the generation process experiences **bottlenecks that slow down with length**. Due to this, Transformer models have limitations where **inference speed decreases** and **memory usage** greatly increases as context length increases.

**Q:** Explain the difference between Transformer encoder-decoder structure and decoder-only structure like GPT.

**A:** **Transformer encoder-decoder models** consist of two parts: **encoder** and **decoder**. The encoder receives input sequences and converts them into internal **context representations**, and the decoder generates output sequences token by token by referring to this context and previously generated tokens. In decoder layers, **masked self-attention** is applied to prevent seeing future tokens, and **encoder-decoder attention (cross-attention)** is used to reference the encoder's context. On the other hand, **decoder-only structures like GPT** are **single-stream** structures with **no encoder**, consisting of only one decoder. They predict the next token using only **self-attention** on previous tokens, with no separate encoder input or cross-attention. In summary, encoder-decoder models are structures where **input and output sequences are separated** and interact (cross-attention), while decoder-only models are structures that perform only **sequential generation** in **one sequence**.

**Q:** How do Transformer's time complexity and space complexity scale with sequence length $L$?

**A:** In **Transformer**, the **time complexity** of Self-Attention operations increases as **$O(L^2)$** with sequence length. This is because similarity is calculated for every token pair, so computational cost is proportional to the square of the number of tokens. Similarly, **memory (space) complexity** also scales as **$O(L^2)$** because attention weight matrices must be stored. For example, if the number of tokens $L$ doubles, computational cost and memory usage increase by four times, so Transformer is inefficient in terms of computational cost and memory when processing very long sequences.

## Mamba Architecture

**Q:** What is Mamba's biggest advantage over Transformer? Explain how Mamba can avoid the $O(n^2)$ bottleneck of attention.

**A:** **Mamba** is a **new sequence model** proposed in 2024, and its biggest advantage is that it can effectively process long sequences **without attention**. It uses a **Selective State Space Model (Selective SSM)** based recurrent structure designed so that processing time increases **linearly with sequence length**, allowing it to handle long contexts without calculating for every token pair like Transformers. Mamba internally **updates hidden states token by token like RNNs**, but introduced **hardware-friendly parallelization algorithms** to solve sequential processing bottlenecks. As a result, it can exchange information between tokens **while avoiding the $O(n^2)$ computation of attention**, and reports show that **inference processing speed is 5x higher than Transformer**. In summary, thanks to Mamba's structure, **sequences of nearly infinite length** can be handled practically, and it is **excellent in computational efficiency and memory usage** even in long contexts.

**Q:** What does "selective" behavior mean in Mamba's Selective SSM? What effects did this achieve in language models?

**A:** In **Selective SSM**, "**selective**" means that **state space model coefficients (e.g., state transition matrices) are dynamically determined as functions of input tokens**. That is, instead of updating states in the same way at all time points, **how much to maintain or forget previous information is controlled according to the current token's content**. This operates like **gates** in RNNs, **long retaining important information and quickly forgetting unnecessary information**. Thanks to this selective state control, Mamba can effectively express **content-based dependencies between tokens** and achieve high performance even in **discrete token data like natural language** that was difficult with fixed SSMs.

**Q:** Mention performance-related characteristics shown by the Mamba-3B model (e.g., comparison with same-size Transformer, comparison with twice-larger Transformer, etc.).

**A:** **Mamba-3B** is a Mamba model with 300 million parameters, and it reportedly showed **superior performance to same-size Transformers** and achieved **performance comparable to Transformers with twice the parameters**. This suggests that thanks to Mamba architecture's efficiency, **Transformer performance can be exceeded or matched even with smaller models**. In other words, Mamba-3B had better language modeling capabilities than 3B-scale Transformers and showed similar results to 6B-scale Transformers, proving **outstanding performance efficiency relative to model size**. These results show that Mamba's **architectural innovation** led to actual model performance improvement.

## RWKV Architecture

**Q:** Explain what shortcomings of Transformer RWKV architecture was designed to solve. Also, how did it combine the advantages of Transformer and RNN respectively?

**A:** **RWKV** is a model designed to **overcome Transformer limitations**, emerging as an alternative to **long context processing and high resource consumption** problems. Transformers have limitations in **context length** due to attention operation constraints and require large GPU resources, but RWKV introduces **RNN series** ideas to support **virtually unlimited context length**. It fully accepts **Transformer's advantage** of **parallel learning** capability, ensuring GPU efficiency by processing entire sequences at once during training (converted to special attention formulas), and combines **RNN's advantage** of **sequential inference efficiency** to generate tokens **one by one like RNNs** during inference. In summary, RWKV is a hybrid architecture that takes advantage of both structures by making it **fast like Transformer during training** and **light like RNN during inference**.

**Q:** How does RWKV's inference method differ from Transformer, and what benefits does this provide? (Hint: KV cache vs hidden state)

**A:** **Transformer** stores **KV cache** of all previous tokens during inference and takes the approach of calculating **attention with the entire cache** at each generation step. On the other hand, **RWKV** has each layer maintain **its own hidden state**, and when a new token comes in, it operates by **updating the previous state**. Therefore, there's no need to store all previous token information in a huge KV cache, just **maintaining a fixed-size hidden state**. The biggest benefit from this difference is **memory efficiency and speed**. RWKV's memory usage hardly increases even as context lengthens, and **computation per token is constant** (not increasing with token count like attention), so it **maintains consistent speed even with very long inputs**. In other words, RWKV is **advantageous for long document processing** compared to Transformers and allows **large LLMs to run relatively smoothly even on low-spec devices**.

**Q:** What does RWKV's name mean, and briefly summarize the roles of Time-mix and Channel-mix.

**A:** **RWKV** stands for **Receptance, Weight, Key, Value**, derived from the names of the four main parameters of the network. Here, **Receptance (R)** acts as a **gate that accepts past information**, **Weight (W)** is an **exponential time weight applied to past information** (coefficient that gradually decreases previous influence over time), and **Key (K)** and **Value (V)** are key/value vectors representing **information conveyed by the current token**.

Each layer of the RWKV architecture is divided into two stages: **Time-mix** stage and **Channel-mix** stage. **Time-mix** is the stage that **mixes current token input with accumulated Key/Value information from previous tokens**, using R and W gates to **decay previous states and integrate new information**. This can be seen as replacing the role of **attention integrating temporal information** in Transformers.

Next, **Channel-mix** is the stage that performs **channel (feature) direction transformation** for each token, applying **token-wise nonlinear transformation** like typical **Feed-Forward Network (FFN)**. During this process, some output from previous tokens is also used as input for **adjustment through gates**, serving a similar role to Transformer's FFN. In summary, RWKV's Time-mix is responsible for **sequential information mixing** (temporal processing), and Channel-mix is responsible for **feature dimension mixing** (channel processing), designed to **perform both token dependencies and internal token transformations without attention**.

## Jamba Architecture

**Q:** In what ratio are Transformer layers and Mamba layers arranged in Jamba architecture? Explain what advantages this design provides in terms of memory and speed.

**A:** **Jamba** is a **hybrid architecture** that **mixes Transformer layers and Mamba layers**. Specifically, it stacks in a form where **several Mamba layers** follow one Transformer (Attention) layer, with **"1:7 ratio"** being the representative configuration. For example, in a Jamba model with 32 layers, only **4 layers use attention**, and the remaining **28 layers are Mamba**.

By **sparsely inserting attention** and filling most with Mamba, **global pattern processing** is handled by occasionally appearing attention layers, and **remaining interactions are processed by efficient Mamba layers**. This design **greatly improves memory usage and speed**, especially since there are few attention layers, **reducing the number of layers that need to store KV cache, making the overall memory footprint smaller**, and **when processing long contexts**, only a few attentions need to be calculated, so **much faster token processing speed compared to Transformer** can be obtained. According to actual reports, Jamba **uses only 1/2 level memory compared to same-scale general Transformers** while **generating text 3x faster for 128K token length inputs**.

**Q:** Why did Jamba introduce MoE? Explain using the concepts of active parameters and total parameters.

**A:** **Jamba** introduced **MoE (Mixture-of-Experts)** technique to maintain efficiency while increasing model capacity. Specifically, some Transformer **MLP layers are replaced with MoE layers** to have **multiple Expert networks**, and **only the top few Experts are activated for each token**. For example, in Jamba, there are 16 Expert MLPs in one MoE layer, designed so that **only the 2 most relevant Experts are activated for each token (top-2 gating)**.

Here, **total parameters** means the total number of parameters of the entire model including all Experts, and **active parameters** means **the number of parameters actually activated and used in computation during one inference**. In Jamba's case, with MoE introduction, **total parameter count increases very greatly (e.g., 5.2B → 52B, etc.)**, but since **only a very small part (e.g., top 2 Experts)** of parameters are used for each token, **actual active parameter scale is limited**. For example, the Jamba 7B model has **about 52B total parameters** through MoE, but **only about 12B are actually activated**.

By doing this, **total model capacity** can be greatly increased to **improve performance**, while **inference computation and memory usage are suppressed to active parameter levels** to maintain efficiency. In short, with MoE introduction, Jamba achieved the effect of **"having the intelligence of a large model but paying only the cost of a small model"**.

**Q:** What is the maximum context length that Jamba supports, and what is the secret to maintaining performance while processing such long contexts?

**A:** **Jamba** supports an ultra-long **context window** of **256K (256,000) tokens**. This is among the **longest context processing capabilities** of currently available Transformer series models, and thanks to this, it's possible to input very long documents at once to perform Q&A or summarization.

The secret to maintaining performance while handling such long contexts lies in the aforementioned design elements. First, since **attention layer count is minimized** and most are composed of Mamba, **burden from attention operations is very small for long inputs**. Also, Mamba layers operate in **linear time**, so computational cost doesn't increase much even as context length increases. In actual experiments, Jamba **processed 128K token inputs on a single 80GB GPU**, and while same-scale general Transformers couldn't process this due to memory limitations, Jamba **operated without difficulty while maintaining output quality at latest LLM levels**. In summary, Jamba's architecture is **specialized to efficiently process long contexts**, and thanks to this, it can **achieve both fast inference and excellent performance even with long inputs**.
