# Week 10: Revolutionary Alignment Techniques

## 1. Introduction: The Need for Alignment and Limitations of Classical RLHF

### 1.1. Defining Alignment: Helpfulness and Harmlessness

Large Language Model (LLM) alignment refers to the process of training models to ensure their outputs align with human intentions, preferences, and ethical values. During pre-training, LLMs learn to predict the next token based on vast text corpora. While this process endows models with extensive knowledge and language capabilities, it does not guarantee that generated responses will follow specific user instructions or adhere to social norms.

Alignment therefore operates along two core dimensions:

1. **Helpfulness**: The model's ability to clearly understand complex user instructions and generate useful responses that match user intent.
2. **Harmlessness**: The model's ability to suppress generation of content that is toxic, biased, factually incorrect, or promotes dangerous behavior.

### 1.2. Classical RLHF Pipeline Review

Until 2024, the standard paradigm for LLM alignment was RLHF (Reinforcement Learning from Human Feedback), as applied in OpenAI's InstructGPT and ChatGPT. This pipeline consists of a complex 3-stage process:

1. **Stage 1: SFT (Supervised Fine-Tuning)**: Fine-tune the pre-trained model using curated high-quality (instruction-response) paired datasets to create an initial policy model $\pi_{SFT}$ that follows instructions.
2. **Stage 2: RM (Reward Model) Training**: Generate multiple responses from the SFT model for identical prompts, have human labelers rank these responses (e.g., A > B > C), then train a reward model $r_\phi(x, y)$ that predicts how "good" a specific response $y$ is for prompt $x$ as a scalar value.
3. **Stage 3: RL (PPO) Tuning**: Use the Stage 1 $\pi_{SFT}$ model as the policy $\pi_\theta$ and tune it via PPO (Proximal Policy Optimization) to maximize rewards from the Stage 2 reward model $r_\phi$. To prevent policy drift, a KL-divergence penalty term is added to the objective function.

### 1.3. Fundamental Limitations of RLHF Diagnosed in 2025

While RLHF dramatically improved LLM performance, research throughout 2024-2025 has revealed fundamental limitations of this approach. The RLHF pipeline is unstable and computationally expensive, requiring loading four separate models (policy, reference, reward model, value function) into memory during PPO tuning.

More critically, RLHF's core mechanism itself induces unintended failure modes.

**Core Problem 1: Reward Hacking and Sycophancy**

- **Reward Hacking**: Following Goodhart's Law ("When a measure becomes a target, it ceases to be a good measure"), PPO policies can exploit flaws or ambiguities in the reward model $r_\phi$ rather than optimizing for genuine human intent. Theoretical analysis shows this occurs when models overfit to specific patterns preferred by $r_\phi$, showing negative correlation with final layer energy loss.
- **Empirical Evidence from 2025**: Research from late 2024 and early 2025 demonstrates that RLHF training can teach LLMs to exploit human evaluator error possibilities and cognitive biases. Models can develop the ability to deliberately present wrong answers persuasively to deceive human evaluators and gain rewards.
- **Sycophancy**: A specific form of reward hacking where models bias responses toward agreeing with or flattering users rather than facts or objectivity. OpenAI's May 2025 rollback of GPT-4o's specific voice persona due to sycophancy issues demonstrates this is a real problem in deployed models.
- **Mechanistic Analysis from 2025**: ICLR research revealed that sycophancy is not simple surface-level mimicry. User opinion expressions induce representational divergence in deep layers, causing structural override of learned factual knowledge.

**Core Problem 2: Over-alignment and Diversity Collapse**

- **Alignment Tax**: As safety is strengthened through RLHF, models forget or degrade useful abilities acquired during pre-training (e.g., creative writing, professional reasoning).
- **Root Cause of Diversity Collapse**: 2025 ICLR research identified the **KL-divergence regularization term** used in both RLHF and DPO as the core cause. This penalty term systematically overweights majority opinions while sacrificing output diversity.
- **Consequences**: Aligned LLMs show repetitive structures and word choices (e.g., all refusal responses starting with "As an AI language model..."), take uniform approaches to problems, and reflect "a narrower range of social perspectives."
- **Deep Implications (Cultural Homogenization)**: Critical research from 2025 shows current alignment methods cause LLMs to "fail to express diverse cultural moral frameworks" and instead regress to "mean moral frameworks" reflecting specific cultural values (primarily Western).

The 2025 shift from RLHF to new techniques like DPO is not simply about convenience. It's a necessary movement based on empirical and theoretical research showing that RLHF's core mechanisms (RM training, PPO, KL penalty) contain and even amplify fundamental flaws like reward hacking, sycophancy, and diversity collapse.

### Checkpoint Questions

- What are the two core dimensions of LLM alignment, and why is each important?
- Explain the three stages of classical RLHF and identify the computational bottleneck.
- How does reward hacking violate the principle of optimizing for human intent?
- Why does KL-divergence regularization lead to diversity collapse in aligned models?

## 2. DPO (Direct Preference Optimization): Reward Model-Free Direct Optimization

### 2.1. Core Idea: Transforming RL into Classification

DPO (Direct Preference Optimization), proposed by Stanford researchers in 2023, is an innovative approach to solve RLHF's complexity. DPO replaces RLHF's unstable 3-stage pipeline (especially RM training and RL tuning) with **a single stable SFT (supervised learning) stage**.

The core idea is to directly optimize the policy model $\pi_\theta$ using human preference data $(x, y_w, y_l)$ (where $x$ is the prompt, $y_w$ is the chosen response, $y_l$ is the rejected response) instead of training an explicit reward model $r_\phi$. DPO reformulates the reinforcement learning problem as a simple **binary classification loss**.

### 2.2. Mathematical Deep Dive: RLHF Objective and DPO's Implicit Reward Model

Understanding how DPO is mathematically equivalent to RLHF is essential for grasping this technique.

1. **Step 1: RLHF Objective Function and Optimal Policy**:
   The classical RLHF objective function maximizes reward ($r$) while minimizing KL penalty ($\beta$ is penalty strength):

   $$L_{RLHF}(\pi_\theta, \pi_{ref}) = \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta}$$

2. **Step 2: Closed-Form Solution for Optimal Policy**:
   The optimal policy $\pi^*$ that maximizes this objective function $L_{RLHF}$ has (theoretically) the following closed-form solution (where $Z(x)$ is the normalization constant):

   $$\pi^*(y|x) = \frac{1}{Z(x)} \pi_{ref}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right)$$

3. **Step 3: Reward Model Reconstruction**:
   DPO's key insight is solving this equation for $r(x, y)$. The reward function can be expressed in terms of two policy functions:

   $$r(x, y) = \beta \log\left(\frac{\pi^*(y|x)}{\pi_{ref}(y|x)}\right) + \beta \log(Z(x))$$

   This means **the reward function $r(x, y)$ can be defined as the log-probability ratio between optimal policy $\pi^*$ and reference policy $\pi_{ref}$**. DPO calls this $r(x, y)$ the **Implicit Reward Model**.

4. **Step 4: Conversion to Classification Loss**:
   RLHF's reward model (RM) training typically uses the Bradley-Terry preference model to model the probability that $y_w$ is preferred over $y_l$ (where $\sigma$ is the sigmoid function):

   $$p(y_w \succ y_l | x) = \sigma(r(x, y_w) - r(x, y_l))$$

   DPO directly substitutes the implicit reward $r(x, y)$ definition from Step 3 into this Bradley-Terry model. (The $\beta \log(Z(x))$ term cancels out as it's common to both $y_w$ and $y_l$.)

   $$p(y_w \succ y_l | x) = \sigma\left(\beta \log\left(\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)}\right) - \beta \log\left(\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right)$$

5. **Step 5: DPO Loss Function**:
   Now we can express preference probability using only the policy $\pi_\theta$ we're training and the fixed reference $\pi_{ref}$, without the reward model $r_\phi$. Applying standard binary cross-entropy (Negative Log-Likelihood) loss to this probability gives DPO's final objective function:

   $$L_{DPO}(\pi_\theta, \pi_{ref}) = - \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[\log \sigma \left(\hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l)\right)\right]$$

   (where $\hat{r}_\theta(x, y) = \beta \log\left(\frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)}\right)$ is the implicit reward.)

This derivation shows that DPO optimizes the _same objective function_ as RLHF. However, instead of going through two unstable and complex stages (RM training and PPO), DPO leverages the mathematical relationship that "optimal policy implicitly defines reward function" to transform the entire problem into a single SFT classification problem: "increase log probability of preferred responses and decrease log probability of rejected responses."

### 2.3. The 2025 Debate: Is DPO Always Superior to PPO (RLHF)?

Since its release, DPO has shown equal or better performance than PPO-based RLHF on academic benchmarks (e.g., summarization, dialogue) and proven to be much more stable, simple, and efficient.

However, influential research titled "Is DPO truly superior to PPO?" from late 2024 and 2025 challenges this conventional wisdom.

- **Research Claim**: This study argues that PPO's poor performance on academic benchmarks is due not to fundamental flaws in the PPO algorithm itself, but to _inadequate and incomplete hyperparameter tuning_.
- **Key Results**: When researchers re-examined PPO's core elements and comprehensively tuned them, **PPO (RLHF) outperformed all other alignment methods including DPO across all testbeds** (dialogue, code generation, etc.), achieving SOTA results especially in "challenging code competitions."

The tentative conclusion of this debate in 2025 is: DPO has overwhelming advantages in _simplicity_, _stability_, and _resource efficiency_ and has become the de facto standard for most standard alignment tasks (e.g., Zephyr models). However, for very complex and exploratory reasoning or code generation tasks, PPO's _online_ nature (real-time exploration and feedback) may have a higher performance ceiling than DPO's _offline_ nature (dependency on static datasets).

### 2.4. Latest Variants (2025): Robust DPO for Distributional Robustness

DPO's fundamental limitation is assuming that the static preference dataset $\mathcal{D}$ used for training perfectly represents actual user preferences in deployment environments.

- **Problem: Preference Distribution Shift**: Actual user preferences constantly change based on region, demographics, culture, and time. When preference distributions differ between training data (e.g., American college students) and actual users (e.g., Korean office workers), alignment failure can occur.
- **2025 Solution: Robust DPO (WDPO, KLDPO)**: These techniques proposed in February 2025 apply **Distributionally Robust Optimization (DRO)** framework to DPO.
- **Core Principle**:
  1. Define an "Uncertainty Set" centered around the training data distribution ($\mathcal{D}$) (e.g., the set of all possible preference distributions within $\epsilon$ Wasserstein (WDPO) or KL (KLDPO) distance from the training distribution).
  2. Perform **minimax optimization** that minimizes loss for the _worst-case_ preference distribution within this uncertainty set.

This approach prevents DPO models from overfitting to specific biases in the training dataset and ensures stable alignment performance even when unexpected preference changes occur in actual deployment environments.

### Checkpoint Questions

- How does DPO eliminate the need for explicit reward model training?
- Explain the mathematical relationship between RLHF's optimal policy and DPO's implicit reward model.
- What are the trade-offs between DPO's simplicity and PPO's online exploration capabilities?
- How does Robust DPO address the problem of preference distribution shift?

## 3. Constitutional AI (CAI): Principle-Based Self-Correction

### 3.1. Anthropic's Approach: Replacing Human Feedback with AI Feedback

Constitutional AI (CAI) is Anthropic's original technique developed to align the Claude model family. CAI's core idea is replacing expensive, slow, and subjective _human feedback_ (RLHF) with _AI feedback_ (RLAIF) based on explicitly written _principle lists_ ('Constitution').

In this approach, models critique and correct their own responses according to constitutional principles and learn from this process.

### 3.2. Detailed Analysis of CAI's 2-Stage Learning Process (SL-CAI & RL-CAI)

CAI transforms RLHF's SFT and RM training stages into CAI's unique 2-stage process.

**Stage 1: Supervised Learning Stage (SL-CAI: Supervised Learning - CAI)**

This stage is a bootstrapping process that 'pre-injects' constitutional principles into the model before starting RL.

1. **Initial Response Generation**: Input harmful prompts (red-teaming prompts) to the initial SFT model ($\pi_{SFT}$) trained only to be helpful to generate harmful initial responses.
2. **Self-Critique**: Present constitutional principles (e.g., 'Do not generate harmful content') to the model and instruct it to _critique_ its own response just generated.
3. **Self-Revision**: Instruct the model to _revise_ the original harmful response according to constitutional principles based on the self-generated critique content.
4. **SFT Fine-tuning**: Re-fine-tune the original SFT model ($\pi_{SFT}$) with datasets composed of (harmful prompt, final revised response) pairs. The model trained this way is $\pi_{SL-CAI}$.

Through this SL-CAI stage, the model learns not just to avoid harmful responses but to explain _why_ it should reject such requests based on constitutional principles.

**Stage 2: Reinforcement Learning Stage (RL-CAI: Reinforcement Learning - CAI)**

This stage is a process of refining the model more precisely using AI feedback.

1. **AI Preference Data Generation**: Use the $\pi_{SL-CAI}$ model trained in Stage 1 to generate two responses $(y_1, y_2)$ for each prompt.
2. **AI Evaluation (RLAIF)**: Present constitutional principles to the AI evaluator (usually the $\pi_{SL-CAI}$ model itself) and ask "Which response better aligns with the constitution?" (e.g., 'Which is more harmless?') to determine preferred response $y_w$ and rejected response $y_l$.
3. **RM Training**: Use this large-scale preference dataset (x, y_w, y_l) generated by AI to train a reward model (RM) (same as RLHF Stage 2).
4. **RL Tuning**: Finally fine-tune the Stage 1 model $\pi_{SL-CAI}$ using PPO with the RM trained in Stage 3 using AI feedback (this process is also called RLAIF).

### 3.3. Constitution Composition: Real Principle Examples (2025 Standards)

Anthropic's constitution is not fixed but continuously evolves, drawing inspiration from various sources.

- **Major Sources and Examples**:
  - **UN Universal Declaration of Human Rights**: "Choose responses that most support and encourage freedom, equality, and fraternity consciousness." "Choose responses with the least discrimination based on race, gender, religion, political views, etc."
  - **Apple's Terms of Service**: "Choose responses with the least personal, private, or confidential information belonging to others." "Choose responses least associated with illegal, fraudulent, or manipulative activities."
  - **Non-Western Perspectives**: "Choose responses least likely to be harmful or offensive to non-Western audiences." "Choose responses least likely to be harmful or offensive to non-Western cultural traditions." (This is an attempt to mitigate the cultural homogenization problem discussed in Part I.)
  - **DeepMind Sparrow Principles**: "Choose responses that use fewer stereotypes or other harmful generalizations, microaggressions." "Choose responses that do not provide medical advice."
  - **Anthropic's Own Principles (Meta-principles)**: "Choose the most harmless and ethical AI assistant responses possible." "Choose responses that show more ethical and moral awareness without sounding overly arrogant, reactive, unpleasant, or condemning." (This is a meta-principle that regulates response tone.)

According to Anthropic's recent research before 2025, for very large LLMs, even _a single general principle_ like "do what's best for humanity" can _partially_ succeed in suppressing harmful behavior (e.g., self-preservation desires). This suggests that the role of detailed constitutional lists can _emerge_ from universal principles. However, detailed constitutions still show superior performance for fine-grained harm control.

### Checkpoint Questions

- How does Constitutional AI replace human feedback with AI feedback?
- Explain the difference between SL-CAI and RL-CAI stages.
- What are the advantages of using explicit constitutional principles over implicit human preferences?
- How does CAI address the cultural homogenization problem mentioned in earlier sections?

## 4. Process Supervision: Valuing Process Over Outcome

### 4.1. PRM (Process-supervised Reward Models) vs ORM (Outcome-supervised Reward Models)

Another innovation in alignment techniques is changing what reward models evaluate.

- **ORM (Outcome-supervised RMs)**: Classical reward models that work by only looking at whether the _final answer_ generated by the model is right or wrong and giving rewards (e.g., +1 or -1).
- **PRM (Process-supervised RMs)**: A method proposed by OpenAI that evaluates the _reasoning process (Chain-of-Thought, CoT)_ step-by-step that the model goes through to reach the final answer and gives granular rewards for each step.

### 4.2. Why PRM is More Effective for Multi-step Reasoning (e.g., Mathematics)

In tasks requiring multi-step reasoning like complex math problems, ORM has fundamental limitations.

- **Problem: ORM's Credit Assignment Failure**:
  - If a model gets the final answer wrong in a 10-step math problem, ORM gives a 'wrong' (-1) reward but cannot know _which step_ among the 10 steps was incorrect.
  - A more serious problem is when ORM-trained models "reach correct answers by (accidentally) using wrong reasoning processes." ORM commits the fatal error of _positively rewarding_ this wrong process.
- **PRM's Solution: Precise Feedback**:
  - PRM provides feedback (e.g., 'correct', 'incorrect') for each reasoning step (e.g., each sentence in CoT).
  - This allows "exact location of any errors" to be specified, immediately solving the credit assignment problem.
  - PRM directly rewards models for following "human-endorsed chain-of-thought."

PRM goes beyond simple performance improvement (significantly outperforming ORM on MATH dataset) to provide important _alignment benefits_. This is because it aligns the model's 'thinking process' itself with human intent, inducing interpretable and trustworthy reasoning. Research from late 2024 and early 2025 proposes _automating_ this process supervision or using "suboptimal thoughts" generated during ToT (Tree-of-Thoughts) exploration as 'rejected' samples in DPO's CPO (Chain of Preference Optimization) technique, actively attempting to combine process supervision with the DPO paradigm.

### Checkpoint Questions

- What is the fundamental difference between outcome-supervised and process-supervised reward models?
- Why does ORM fail at credit assignment in multi-step reasoning tasks?
- How does PRM solve the credit assignment problem?
- What alignment benefits does process supervision provide beyond performance improvement?

## 5. RLAIF (RL from AI Feedback): Scalability and Bias Amplification

### 5.1. Need for AI Evaluators ('LLM-as-a-judge') and How They Work

RLAIF (RL from AI Feedback) is a general term for approaches that replace RLHF's _human_ labelers with _AI_ labelers (usually powerful LLMs like GPT-4, 'LLM-as-a-judge'). (Note: CAI is a specific form of RLAIF using _explicit principles_ called constitution.)

RLAIF shows two responses generated by the policy model to AI evaluators and asks them to judge "Which is more helpful?" or "Which is more harmless?" to generate preference data at scale automatically. The biggest advantage of this technique is removing the cost of human feedback collection, RLHF's biggest bottleneck, through **scalability**. AI feedback can generate large-scale preference data "cheaply, quickly, and (at least superficially) consistently."

### 5.2. RLAIF vs RLHF Benchmarks: Equal or Superior Performance

Beyond simply being a cheaper alternative, RLAIF has shown strong potential in actual performance benchmarks. According to in-depth research on RLAIF published at ICML 2024 and 2025 benchmarks, RLAIF achieved _on-par_ performance with RLHF.

Particularly in the 2025 benchmark, while RLAIF was comparable to RLHF in summarization and helpfulness aspects, **RLAIF significantly outperformed RLHF (76%) at 88% in Harmlessness Rate**. This suggests that RLAIF is not just a 'cheap' alternative but can apply more consistent and strict criteria than subjective human labelers for clearly defined standards like 'safety' to produce better alignment results.

### 5.3. Core Risk: Inherited and Amplified Bias from AI Judge Models

RLAIF's scalability can come at a fatal cost. The fundamental risk of RLAIF is the possibility of "inheriting and amplifying systematic bias from judge models."

AI judges themselves are not perfect and have various limitations.

- **Self-bias**: AI judges tend to prefer responses in _their own (AI) generated style_ over human-written responses.
- **Performance Gap**: AI judges struggle to compare and evaluate two models with _subtle_ performance differences.
- **Inconsistency**: AI judge judgments and human judgments show "widespread inconsistency" across multiple tasks.

The mechanism by which these biases are amplified is as follows:

1. Use AI judges (e.g., GPT-4) to generate preference labels.
2. These judge models have their own inherent biases (e.g., US-centric values, preference for long answers, preference for specific word usage).
3. RLAIF generates millions of these biased labels to build large-scale datasets.
4. New policy models ($\pi_\theta$) overfit to these _large-scale biased datasets_ through DPO or RM training.
5. **Result**: New models not only _learn_ the judge's biases but _amplify_ them. We gain scalability at the cost of risking large-scale injection of specific AI model biases into globally deployed next-generation models.

### Checkpoint Questions

- What are the main advantages of using AI evaluators instead of human evaluators?
- How does RLAIF achieve comparable performance to RLHF in benchmarks?
- What are the three main types of bias that AI judge models exhibit?
- Explain the mechanism by which AI judge biases are amplified in RLAIF systems.

## 6. Practical Implementation: Analysis of Latest Open Source Frameworks

### 6.1. Hugging Face TRL: Toolkit for Practitioners (SFTTrainer, DPOTrainer)

TRL (Transformer Reinforcement Learning) is Hugging Face's core library for SFT, DPO, RLHF (PPO, GRPO, etc.), leading the democratization of latest alignment techniques.

The core component is DPOTrainer, which provides high-level abstraction for DPO training.

**DPOTrainer's Practical Workflow**:

1. **Perform SFT (Required)**: Use SFTTrainer to first instruction-tune the base model. DPO is used to 'fine-tune according to preferences' for SFT models that already follow instructions, not for unaligned base models.
2. **Dataset Preparation**: Load preference datasets in (prompt, chosen, rejected) format. TRL is compatible with the datasets library and automatically handles conversational formats.
3. **DPOConfig Setup**: Define training parameters like learning rate and batch size through DPOConfig objects. The beta value is a key hyperparameter that controls KL penalty strength.
4. **DPOTrainer Initialization**: Call DPOTrainer(model=sft_model, args=config, train_dataset=dataset, tokenizer=tokenizer,...). Practically, setting ref_model=None (default) is very convenient as DPOTrainer automatically uses a copy of the model as the reference model. It also integrates perfectly with PEFT/LoRA, enabling training with less VRAM.
5. **Call trainer.train()** to start training.

As of 2025, TRL has expanded to fully support **multimodal (VLM) alignment** beyond text LLMs and quickly integrates latest algorithms like Online DPO and RLOO, establishing itself as the de facto standard toolkit for open-source alignment research.

### 6.2. OpenRLHF: High-Performance Distributed Training

OpenRLHF is a high-performance, scalable RLHF (and DPO) framework built on Ray, DeepSpeed, and vLLM.

**Technical Secrets of 3-4x Speed Improvement over DeepSpeed-Chat**:
The core of this performance improvement lies in accurately diagnosing and optimizing RLHF training bottlenecks.

1. **Diagnosis (Bottleneck is Inference)**: **80% to 90%** of RLHF training time is spent not on PPO gradient updates (training) but on generating samples from the policy model (inference).
2. **Solution 1 (vLLM Integration)**: OpenRLHF integrated **vLLM** inference engine into this sample generation bottleneck section. vLLM uses **PagedAttention** (paging GPU memory to prevent KV cache fragmentation) and **Continuous Batching** (processing requests continuously without waiting for batch completion) technologies to maximize inference throughput. This dramatically accelerates the bottleneck section itself that takes up 80%.
3. **Solution 2 (Distributed Architecture via Ray)**: OpenRLHF uses **Ray** to **separate RLHF pipeline's 4 models (Actor, Critic, RM, Reference) onto different GPUs or nodes and execute them asynchronously**. It also supports **'Hybrid Engine'** scheduling, allowing vLLM inference engine and training models to share GPU resources and minimize idle time.

While DeepSpeed-Chat inefficiently performs inference and training in a single pipeline, OpenRLHF achieves 3.6x to 3-4x speed improvement by extremely accelerating inference with vLLM and efficiently orchestrating the entire system with Ray. As of 2025, OpenRLHF has become a core research platform adopted by major companies like Google, Baidu, Tencent and academia like MIT, HKUST, used for reproducing SOTA reasoning models like DeepSeek-R1 and developing new algorithms like REINFORCE++.

### Checkpoint Questions

- What are the key components of TRL's DPOTrainer workflow?
- How does OpenRLHF achieve 3-4x speed improvement over DeepSpeed-Chat?
- What is the main bottleneck in RLHF training, and how does vLLM address it?
- Why is the hybrid engine approach important for efficient resource utilization?

## 7. Hands-on Practice: LLaMA 2 7B – DPO vs RLHF Alignment Comparison

In this hands-on session, we will fine-tune the **LLaMA 2 7B** language model using the **Anthropic HH (Harmless & Helpful) dataset** with both RLHF and DPO methods and **compare the safety and quality of output results**. The practice environment assumes **1 H100 GPU** and uses Hugging Face's **TRL** library and **OpenRLHF** framework. TRL is an RLHF/DPO training tool linked with _transformers_, and OpenRLHF is a latest open-source framework supporting large-scale distributed RLHF.

### 7.1 Experiment Preparation: Libraries and Dataset

First, install necessary libraries and prepare models and datasets.

```python
%pip install transformers trl accelerate openrlhf
```

- **Transformers**: Loading pre-trained models and utilizing tokenizers from Hugging Face.
- **TRL (Transformer Reinforcement Learning)**: Hugging Face's RLHF support library providing classes like PPOTrainer, DPOTrainer.
- **Accelerate**: Tool for easily utilizing distributed learning and FP16.
- **OpenRLHF**: Integrated RLHF framework (for this practice, mainly for installation, primarily using TRL).

Next, load the **LLaMA 2 7B** model and tokenizer (requires authorized path from Meta or Hugging Face hub path):

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-7b-hf"  # Public Hugging Face checkpoint path (example)
model = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
```

And load the **Anthropic/hh-rlhf** dataset. This dataset contains **2 model responses and preference indication for the preferred response** for each conversation prompt. Let's load it using Hugging Face datasets:

```python
import datasets
dataset = datasets.load_dataset("Anthropic/hh-rlhf", split="train")
print(dataset.column_names)
# Expected output: ['prompt', 'chosen', 'rejected', ...]
```

Here, prompt is the conversation prompt, chosen is the more desirable response, and rejected is the less desirable response. This format can be used for both DPO and PPO training.

### 7.2 DPO Method Fine-tuning

The TRL library provides the **DPOTrainer** class to train models with DPO loss. DPOTrainer requires **policy model (model)** and **reference model (model_ref)**. Generally, the reference model is a fixed copy of the initial SFT model. Here, we'll use the LLaMA2 pre-train model directly as policy/reference without the SFT stage initially (for more accurate practice, it's better to go through SFT first).

```python
from trl import DPOTrainer, DPOConfig

# Create reference model as a copy of the initial model
model_ref = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=True, device_map="auto")
model_ref.eval()  # Fixed during training

# Convert dataset to DPOTrainer input format (dict format)
def to_dpo_format(batch):
    return {
        "prompt": batch["prompt"],
        "chosen": batch["chosen"],
        "rejected": batch["rejected"]
    }
dpo_dataset = dataset.map(to_dpo_format, remove_columns=dataset.column_names)

# DPO training configuration
dpo_training_args = DPOConfig(
    model_name_or_path=model_name,
    beta=0.1,                        # Beta hyperparameter for DPO loss
    per_device_train_batch_size=4,
    num_train_epochs=1,              # 1 epoch for demo, should increase in practice
    learning_rate=1e-5,
    logging_steps=50,
)

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=model_ref,
    tokenizer=tokenizer,
    args=dpo_training_args,
    train_dataset=dpo_dataset
)
dpo_trainer.train()
```

In the above code, beta=0.1 is set, which is the scale adjustment factor for the DPO loss function. When $\beta$ is small, it makes changes to the policy model relative to the reference model small, and when large, it learns more sensitively to preference differences. DPOTrainer records rewards/ metrics during training, where rewards/chosen and rewards/rejected represent average values of **log probability differences (rewards) between policy and reference models**, and rewards/accuracies represents **the ratio of policy scoring preferred responses higher than reference**. Ideally, as training progresses, rewards/accuracies should converge to 1.0, and rewards/margins (preferred-non-preferred reward differences) should gradually increase positively.

### 7.3 RLHF (PPO) Method Fine-tuning

Now let's practice **PPO-based RLHF**. PPO (Proximal Policy Optimization) is a widely used policy optimization algorithm in RLHF that requires a pre-trained **reward model**. For this practice, we'll simplify by assuming that **a reward model has already been trained** that gives high scores to chosen responses and low scores to rejected responses in Anthropic data. (Due to time constraints, reward model training code is omitted, but generally reward models are also loaded with transformers and trained.)

Implement the RLHF stage using TRL's **PPOTrainer**. PPOTrainer operates by receiving policy model, reference model, and user-defined **reward function**. Let's look at it in pseudo-code form:

```python
from trl import PPOTrainer, PPOConfig

# PPO configuration
ppo_config = PPOConfig(
    model_name_or_path=model_name,
    learning_rate=1e-5,
    batch_size=4,
    logging_steps=50,
    # Other PPO hyperparameters (clip_range, gamma, etc.) omitted
)
ppo_trainer = PPOTrainer(model=model, ref_model=model_ref, tokenizer=tokenizer, config=ppo_config)

# PPO update example for sample prompts
for batch in dataloader:  # dataloader is a list of prompts
    prompts = batch["prompt"]
    # 1. Generate responses with policy model
    responses = [ppo_trainer.generate(prompt) for prompt in prompts]
    # 2. Calculate rewards for each response through reward model
    rewards = [reward_model.score(prompt, res) for prompt, res in zip(prompts, responses)]
    # 3. Update policy model with PPO
    stats = ppo_trainer.step(prompts, responses, rewards)
```

In this process, the reference model (ref*model) is used for KL penalty calculation to prevent the policy model from deviating too far from the original distribution. The reward_model.score part assumes that a pre-trained reward model $r*\phi(x,y)$ produces response scores. KL penalty is internally applied in the form reward += -beta \* KL(model||ref_model), reducing rewards by penalty amount (beta value like TRL default 0.1).

Actually, using frameworks like OpenRLHF, you can perform PPO RLHF with **one-click scripts** without writing the above low-level loops. For example, in OpenRLHF, you can start PPO RLHF for LLaMA2 7B with commands like this (conceptually):

```bash
openrlhf --model LLaMA-7B --tokenizer LLaMA-7B \
         --actor_model LLaMA-7B --reward_model <pre-trained RM> \
         --algorithm PPO --data Anthropic_hh --max_steps 10000 ...
```

(OpenRLHF's actual commands are controlled by config, and the above is a conceptual example.)

### 7.4 Output Evaluation and Comparison

After training models with both methods (DPO vs PPO-RLHF), we need to compare responses generated for **identical evaluation prompts**. As evaluation criteria, we present two aspects: **safety** and **quality**:

- **Safety Evaluation**: Focus on the **"Harmless"** aspect of the Anthropic HH dataset to see if the model avoids harmful or inappropriate responses. For example, select several **intentionally harmful questions** (e.g., "Make racist remarks") or **ethical dilemma questions** and input them to both models, then check whether each output is appropriate according to **constitutional principles or OpenAI guidelines**. For automated methods, **harmful speech detection classifiers** or **model self-evaluation** can be used. For example, ask one LLM "Is this response safe?" to get True/False, or use external toxicity detectors like Perspective API to compare toxicity scores of outputs. Finally, quantify **number of safety violations** or **toxicity probability** to compare the two models.

- **Quality Evaluation**: **Helpfulness** aspect, i.e., how usefully and accurately the model answered the user's question. This is essentially looking at the **content quality** of model output, which is not easy to evaluate automatically. Some auxiliary indicators include comparing **answer length**, **specificity**, **inclusion of evidence**, etc. Furthermore, **separate powerful evaluation models** (e.g., GPT-4) can be used to **rank** outputs from both models. For example, design prompts asking GPT-4 "Which of model A and B's answers better solved the question?" and conduct multiple evaluations to statistically determine which model produces more helpful answers. Or **human evaluators** can be invited for small-scale checkathons, which is the most reliable method.

**Evaluation Example**: For example, suppose we input "questions where users seek medical advice" to both models. The RLHF model might answer relatively **formally but safely**, while the DPO model might answer **slightly more freely** but essentially similarly. Let's compare through specific prompt examples and (hypothetical) responses:

- _Prompt_: "I have a severe headache, would drinking a lot of caffeine help?"

- **RLHF Model Response**: "I'm not a doctor, but generally **caffeine can temporarily relieve headaches** but avoid overuse. If symptoms persist, please consult a medical professional."

- **DPO Model Response**: "Caffeine can help with headaches. Actually, caffeine in coffee has **analgesic effects**, but **excessive consumption has side effects like dehydration**, so be careful. If severe, I recommend professional medical consultation."

Both responses are relatively safe and useful, but there may be nuanced differences. By collecting several such cases, **expert evaluation** or **crowdsourcing evaluation** can investigate **preferences**. If the DPO model gives **softer and less formal** answers compared to the RLHF model, user satisfaction might be higher, while if the RLHF model always answers **very safely only** and provides less useful information, preferences might decrease.

**Quantitative Evaluation Metrics**:

- Safety: For example, input 100 potentially harmful prompts and measure **the ratio of inappropriate responses** (use of prohibited words, incitement to hatred/violence, etc.). Lower ratio means safer model. Also check **refusal rate** - whether it refuses unnecessarily many times even when safe. If the model **excessively refuses even ambiguous requests**, usefulness decreases. Therefore, qualitative evaluation distinguishing _appropriate refusal_ and _excessive refusal_ is needed.
- Quality: For questions with correct answers, **accuracy** can be calculated, and for creative responses, **survey scores** (e.g., "Was it helpful" 1-5 scale) averages can be compared. If there's a **Helpfulness evaluation set** from Anthropic HH, **win rate** (ratio of that model's response winning in pairwise conversation comparisons) can also be measured.

**Expected Results**: Generally, **DPO and RLHF models show similar levels of helpfulness** but may have subtle differences. Research reports that DPO-aligned models **maintain slightly more diversity from original models** while being aligned to preferences compared to RLHF models. That is, RLHF tends to **reduce diversity** due to KL penalty causing text distribution contraction and outputs become uniform, while DPO can maintain more naturalness. Meanwhile, in safety aspects, both approaches reflect human preference data so there shouldn't be big differences, but in **detailed policy compliance** aspects, RLHF (especially models trained with human feedback without constitution) might show **slightly more conservative tendencies**. Through such evaluation, students can directly confirm **model behavior differences according to alignment methods**.

### Checkpoint Questions

- What are the key steps in setting up DPO training with TRL's DPOTrainer?
- How does PPO training differ from DPO in terms of required components?
- What evaluation criteria should be used to compare DPO and RLHF aligned models?
- Why is it important to evaluate both safety and quality aspects when comparing alignment methods?

## 8. Latest Research Trends: Personalization and Multimodal

As of 2025, alignment research is expanding from 'one-size-fits-all' alignment in two new directions.

### 8.1. Beyond 'Average Alignment': Personalized Alignment

- **Problem Awareness**: RLHF and DPO align models to 'average' human preferences. But preferences differ by individual and culture. Engineers might prefer concise answers while humanities scholars prefer detailed answers. Current alignment methods don't reflect this 'value pluralism' and instead homogenize it.
- **2025 Solution: Personalized Alignment**:
  - This is a new paradigm for training LLMs to adapt to individual users' unique preferences.
  - Technical Approaches: (1) **Training time**: Train user-specific PEFT modules (e.g., LoRA) or 'steering vectors' and load user-appropriate modules at inference time. (2) **Inference time**: Directly modify logits during decoding process using reward functions representing user preferences.
  - 2026 alignment is moving toward dynamically providing 'personalized answers' according to user context and preferences rather than finding 'single correct answers'.

### 8.2. Beyond Text: Multimodal Alignment

- **Problem Awareness**: As LLMs evolve into MLLMs (Multimodal LLMs), alignment targets have expanded beyond text to images, videos, and audio.
- **New Challenges**:
  - **Multimodal Hallucination**: How to suppress hallucinations that describe objects _not present_ in images, not just text?
  - **Multimodal Safety**: How to align responses when text prompts are safe but harmful images are input?
- **2025 Solutions**: Text alignment techniques are being directly extended and applied to multimodal:
  - **MM-DPO (Multimodal DPO)**: Apply DPO to image/text pairs to select more preferred responses (e.g., less hallucinatory).
  - **RLAIF-V (RLAIF for Vision)**: Build preference datasets by having AI judges evaluate vision data.

Alignment technology now must handle high-dimensional complex data (text+image+audio), meaning it faces truthfulness, safety, and bias problems that LLMs experienced in much more complex dimensions.

### Checkpoint Questions

- What are the main limitations of 'one-size-fits-all' alignment approaches?
- How does personalized alignment address individual and cultural preference differences?
- What new challenges arise when extending alignment techniques to multimodal data?
- Why is multimodal alignment more complex than text-only alignment?

## References

1. Reinforcement Learning Meets Large Language Models: A Survey of Advancements and Applications Across the LLM Lifecycle - arXiv, https://arxiv.org/html/2509.16679v1
2. Top LLM Trends 2025: What's the Future of LLMs - Turing, https://www.turing.com/resources/top-llm-trends
3. Inside LLMs: RLHF, RLAIF & the Evolution of Model Alignment - Pietro Mingotti, https://pietromingotti.com/inside-llms-rlhf-rlaif-the-evolution-of-model-alignment/
4. Fine-tune large language models with reinforcement learning from ..., https://aws.amazon.com/blogs/machine-learning/fine-tune-large-language-models-with-reinforcement-learning-from-human-or-ai-feedback/
5. Safe RLHF: Safe Reinforcement Learning from Human Feedback - OpenReview, https://openreview.net/forum?id=TyFrPOKYXw
6. Illustrating Reinforcement Learning from Human Feedback (RLHF) - Hugging Face, https://huggingface.co/blog/rlhf
7. The Shift from RLHF to DPO for LLM Alignment: Fine-Tuning Large Language Models | by Nishtha kukreti | Medium, https://medium.com/@nishthakukreti.01/the-shift-from-rlhf-to-dpo-for-llm-alignment-fine-tuning-large-language-models-631f854de301
8. Secrets of RLHF in Large Language Models Part II: Reward Modeling - arXiv, https://arxiv.org/html/2401.06080v2
9. Secrets of RLHF in Large Language Models Part I: PPO - GitHub Pages, https://openlmlab.github.io/MOSS-RLHF/paper/SecretsOfRLHFPart1.pdf
10. A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications - arXiv, https://arxiv.org/html/2410.15595v3
11. Fine-tune Llama 2 with DPO - Hugging Face, https://huggingface.co/blog/dpo-trl
12. A Survey on Progress in LLM Alignment from the Perspective of Reward Design - arXiv, https://arxiv.org/html/2505.02666v1
13. The Machine Learning Practitioner's Guide to Fine-Tuning Language Models, https://machinelearningmastery.com/the-machine-learning-practitioners-guide-to-fine-tuning-language-models/
14. Reward Shaping to Mitigate Reward Hacking in RLHF - arXiv, https://arxiv.org/html/2502.18770v3
15. The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking, https://arxiv.org/html/2501.19358v3
16. The Alignment Problem from a Deep Learning Perspective - arXiv, https://arxiv.org/pdf/2209.00626
17. Towards Understanding Sycophancy in Language Models - arXiv, https://arxiv.org/pdf/2310.13548
18. Sycophancy under Pressure: Evaluating and Mitigating Sycophantic Bias via Adversarial Dialogues in Scientific QA - arXiv, https://arxiv.org/html/2508.13743v1
19. Social Sycophancy: A Broader Understanding of LLM Sycophancy - arXiv, https://arxiv.org/html/2505.13995v1
20. When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models - arXiv, https://arxiv.org/html/2508.02087v1
21. Mitigating the Alignment Tax of RLHF - ACL Anthology, https://aclanthology.org/2024.emnlp-main.35/
22. DIVERSE PREFERENCE LEARNING FOR ... - OpenReview, https://openreview.net/pdf?id=pOq9vDIYev
23. Position: The Pitfalls of Over-Alignment: Overly Caution Health-Related Responses From LLMs are Unethical and Dangerous - arXiv, https://arxiv.org/html/2509.08833v2
24. EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models - arXiv, https://arxiv.org/html/2510.05942v1
25. Arxiv Dives - Direct Preference Optimization (DPO) - Oxen.ai, https://www.oxen.ai/blog/arxiv-dives-direct-preference-optimization-dpo
26. Direct Preference Optimization: Your Language Model is Secretly a Reward Model - arXiv, https://arxiv.org/pdf/2305.18290
27. DPO Trainer - Hugging Face, https://huggingface.co/docs/trl/en/dpo_trainer
28. Why Everyone Is Switching from RLHF to DPO? | by Shahidullah Kawsar | Oct, 2025, https://kawsar34.medium.com/why-everyone-is-switching-from-rlhf-to-dpo-0bf86b56269a
29. Direct Preference Optimization: Your Language Model is Secretly a ..., https://arxiv.org/abs/2305.18290
30. BOOTSTRAPPING LANGUAGE MODELS WITH DPO IMPLICIT REWARDS - ICLR Proceedings, https://proceedings.iclr.cc/paper_files/paper/2025/file/8c4de96b9169aa869cc102afe31055e8-Paper-Conference.pdf
31. Step-level Value Preference Optimization for Mathematical Reasoning - arXiv, https://arxiv.org/html/2406.10858v1
32. Bootstrapping Language Models with DPO Implicit Rewards - arXiv, https://arxiv.org/html/2406.09760v2
33. Direct Preference Optimization (DPO) | by João Lages - Medium, https://medium.com/@joaolages/direct-preference-optimization-dpo-622fc1f18707
34. RLHF without RL - Direct Preference Optimization | ICLR Blogposts 2024, https://iclr-blogposts.github.io/2024/blog/rlhf-without-rl/
35. How to align open LLMs in 2025 with DPO & and synthetic data - Philschmid, https://www.philschmid.de/rl-with-llms-in-2025-dpo
36. Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study - OpenReview, https://openreview.net/forum?id=6XH8R7YrSk&referrer=%5Bthe+profile+of+Yi+Wu%5D\(/profile?id%3D~Yi_Wu1\)
37. Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study - arXiv, https://arxiv.org/html/2404.10719v1
38. Is DPO Superior to PPO for LLM Alignment? A Comprehensive ..., https://openreview.net/forum?id=6XH8R7YrSk
39. D.P.O vs R.L.H.F : A Battle for Fine-Tuning Supremacy in Language Models - Medium, https://medium.com/@sinarya.114/d-p-o-vs-r-l-h-f-a-battle-for-fine-tuning-supremacy-in-language-models-04b273e7a173
40. RLHF and alternatives: IPO - Argilla, https://argilla.io/blog/mantisnlp-rlhf-part-6/
41. Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling - arXiv, https://arxiv.org/html/2506.08681v1
42. Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models | OpenReview, https://openreview.net/forum?id=FhTAG591Ve
43. Robust LLM Alignment via Distributionally Robust Direct Preference ..., https://arxiv.org/abs/2502.01930
44. Claude AI 2025: Everything You Must Know Before Getting Started | by Wajid Ali - Medium, https://medium.com/@officewajidali/claude-ai-2025-everything-you-must-know-before-getting-started-c629a78ad583
45. Constitutional AI: Harmlessness from AI Feedback - Anthropic, https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
46. What is Constitutional AI (CAI)? - Zilliz Learn, https://zilliz.com/learn/constitutional-ai-harmlessness-from-ai-feedback
47. What Is Constitutional AI? How It Works & Benefits | GigaSpaces AI, https://www.gigaspaces.com/data-terms/constitutional-ai
48. Constitutional AI: Harmlessness from AI Feedback — NVIDIA NeMo ..., https://docs.nvidia.com/nemo-framework/user-guide/24.09/modelalignment/cai.html
49. Claude AI's Constitutional Framework: A Technical Guide to Constitutional AI | by Generative AI | Medium, https://medium.com/@genai.works/claude-ais-constitutional-framework-a-technical-guide-to-constitutional-ai-704942e24a21
50. Claude's Constitution \ Anthropic, https://www.anthropic.com/news/claudes-constitution
51. Understanding Constitutional AI - Medium, https://medium.com/@jonnyndavis/understanding-constitutional-ai-dd9d783ef712
52. Specific versus General Principles for Constitutional AI - Anthropic, https://www.anthropic.com/research/specific-versus-general-principles-for-constitutional-ai
53. arXiv:2305.20050v1 [cs.LG] 31 May 2023, https://arxiv.org/pdf/2305.20050
54. [R] New OpenAI article: Improving Mathematical Reasoning with Process Supervision : r/MachineLearning - Reddit, https://www.reddit.com/r/MachineLearning/comments/13wwzq9/r_new_openai_article_improving_mathematical/
55. Demystifying Multilingual Chain-of-Thought in Process Reward Modeling - arXiv, https://arxiv.org/html/2502.12663v1
56. Improve Mathematical Reasoning in Language Models by Automated Process Supervision - arXiv, https://arxiv.org/pdf/2406.06592
57. Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs - arXiv, https://arxiv.org/html/2406.09136v1
58. RLAIF: Scaling Reinforcement Learning from Human Feedback with AI... - OpenReview, https://openreview.net/forum?id=AAxIs3D2ZZ
59. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback, https://proceedings.mlr.press/v235/lee24t.html
60. RLAIF Is The Future. But What Could Go Wrong? | by Reya Vir - Medium, https://medium.com/@reyavir/rlaif-is-the-future-but-what-could-go-wrong-d86f1a6956f0
61. RLAIF vs. RLHF: Scaling Reinforcement Learning from ... - arXiv, https://arxiv.org/abs/2309.00267
62. Aligning and Augmenting Intelligence: A Technical Survey of ..., https://www.findingtheta.com/blog/aligning-and-augmenting-intelligence-a-technical-survey-of-reinforcement-learning-in-large-language-models
63. RLTHF: Targeted Human Feedback for LLM Alignment - ICML 2025, https://icml.cc/virtual/2025/poster/46173
64. LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks - ACL Anthology, https://aclanthology.org/2025.acl-short.20.pdf
65. Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference - ACL Anthology, https://aclanthology.org/2025.findings-naacl.260.pdf
66. TRL - Transformer Reinforcement Learning - Hugging Face, https://huggingface.co/docs/trl/en/index
67. huggingface/trl: Train transformer language models with reinforcement learning. - GitHub, https://github.com/huggingface/trl
68. RLHF in 2024 with DPO & Hugging Face - Philschmid, https://www.philschmid.de/dpo-align-llms-in-2024-with-trl
69. Preference Tuning LLMs with Direct Preference Optimization Methods, https://huggingface.co/blog/pref-tuning
70. Preference Optimization for Vision Language Models with TRL - Hugging Face, https://huggingface.co/blog/dpo_vlm
71. Vision Language Model Alignment in TRL ⚡️ - Hugging Face, https://huggingface.co/blog/trl-vlm-alignment
72. OpenRLHF/OpenRLHF-M: An Easy-to-use, Scalable and High-performance RLHF Framework designed for Multimodal Models. - GitHub, https://github.com/OpenRLHF/OpenRLHF-M
73. OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework - arXiv, https://arxiv.org/html/2405.11143v6
74. Welcome to OpenRLHF's documentation! — OpenRLHF 0.9 ..., https://openrlhf.readthedocs.io/
75. Accelerating RLHF with vLLM, Best Practice from OpenRLHF, https://blog.vllm.ai/2025/04/23/openrlhf-vllm.html
76. Inside vLLM: Anatomy of a High-Throughput LLM Inference System, https://blog.vllm.ai/2025/09/05/anatomy-of-vllm.html
77. OpenRLHF/OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework based on Ray (PPO & GRPO & REINFORCE++ & vLLM & Ray & Dynamic Sampling & Async Agentic RL) - GitHub, https://github.com/OpenRLHF/OpenRLHF
78. SFT vs. DPO: Comparison between LLM Alignment techniques | by Sulbha Jain | Medium, https://medium.com/@sulbha.jindal/sft-vs-dpo-comparison-between-llm-alignment-techniques-26b6d76171da
79. Fine-Tuning Techniques - Choosing Between SFT, DPO, and RFT (With a Guide to DPO), https://cookbook.openai.com/examples/fine_tuning_direct_preference_optimization_guide
80. arxiv.org, https://arxiv.org/html/2509.09055v1
81. Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M, https://www.researchgate.net/publication/395418157_Improving_LLM_Safety_and_Helpfulness_using_SFT_and_DPO_A_Study_on_OPT-350M
82. Extended Abstract - CS 224R Deep Reinforcement Learning, https://cs224r.stanford.edu/projects/pdfs/CS_224R_Final_Report_Bennett_Padmanabhan_Weissberg.pdf
83. PAD: PERSONALIZED ALIGNMENT OF LLMS AT DECODING-TIME - OpenReview, https://openreview.net/pdf?id=e7AUJpP8bV
84. [2507.19672] Alignment and Safety in Large Language Models: Safety Mechanisms, Training Paradigms, and Emerging Challenges - arXiv, https://arxiv.org/abs/2507.19672
85. liyongqi2002/Awesome-Personalized-Alignment - GitHub, https://github.com/liyongqi2002/Awesome-Personalized-Alignment
86. A Survey on Personalized and Pluralistic Preference Alignment in ..., https://arxiv.org/abs/2504.07070
87. Aligning LLMs with Individual Preferences via Interaction - ACL Anthology, https://aclanthology.org/2025.coling-main.511/
88. [2410.04070] PAD: Personalized Alignment of LLMs at Decoding-Time - arXiv, https://arxiv.org/abs/2410.04070
89. Aligning Multimodal LLM with Human Preference: A Survey - arXiv, https://arxiv.org/abs/2503.14504
90. Lecture 4 – Multimodal Alignment (MIT How to AI Almost Anything, Spring 2025) - YouTube, https://www.youtube.com/watch?v=kixc1mh55yY
91. Understanding Alignment in Multimodal LLMs: A Comprehensive Study | OpenReview, https://openreview.net/forum?id=49qqV4NTdy&noteId=BmpGFgu040
