Feature-Based vs. GAN-Based Imitation: When and Why

May 14, 2025·
Chenhao Li
Chenhao Li
· 11 min read
Image credit: Unsplash

The field of using offline, state-based reference data to inform reinforcement learning (RL) lacks a universally agreed-upon term. Phrases like imitation learning, learning from demonstrations, and demonstration learning are frequently overloaded and inconsistently used across literature. In this post, I’ll refer to this family of techniques simply as learning from demonstrations—specifically meaning methods that utilize offline state-based reference data to compute a reward signal, which quantifies similarity between policy behavior and reference behavior.

Why Learn from Demonstrations?

Before diving into specific methods, I want to lay out a key motivation for learning from demonstrations. Many emphasize that reference motions tend to look more natural or visually pleasing—but this often overlooks the substantial benefits they bring to learning efficiency. Reference data is frequently treated as a form of regularization, rather than as a true learning signal. In reality, reference motions act as crucial guidance, biasing the policy’s exploration toward meaningful behaviors. This was less critical several years ago when our agents operated in lower-dimensional spaces (e.g., quadrupeds), where meticulous reward shaping or curriculum design could still reliably coax out complex behaviors. However, as recent progress in humanoid control has shown, this approach no longer scales. As the dimensionality of action and state spaces grows, the cost of manually crafting exploration incentives becomes impractical—and often leads to failed discovery of viable behaviors. In this light, reference motions are not mere constraints—they are essential sources of learning signal.

Why This Post?

This post stems from observing that practitioners often adopt one method over another without rigorous justification, attributing success to the method itself without considering that similar results could be achieved with alternatives—or fully understanding why one works over the other. I’ll walk through these methods in roughly chronological order, then compare them based on their algorithmic characteristics.

Feature-Based Beginnings: DeepMimic and Its Legacy

Feature-based approaches trace back to DeepMimic [1], which now feels intuitive in hindsight. These methods align the policy with reference motions by introducing a time-index (or phase) variable and compute feature-wise distances between the policy and reference trajectories at synchronized time steps. With a strong regression signal, such methods excel at reproducing fine-grained motion details. However, DeepMimic has significant limitations when handling diverse reference motions. The proposed one-hot motion encoding enables inclusion of multiple motion clips but fails to model the relationships between those motions. While temporal consistency is handled via the phase index, the spatial coherence across different motions is not. Transitioning between motions via hard switches, without modeling relationships, often leads to jarring discontinuities. What’s missing here is a structured motion representation space. A well-designed representation not only enables smoother transitions but also improves generalization. Policies that are trained over such structured and expressive motion spaces are more likely to generalize to novel behaviors beyond those seen in the reference dataset.

Scaling with Diversity: The Rise of GAN-Based Methods

To address DeepMimic’s limitations in handling diverse reference motions, AMP [2] introduced adversarial training. In this framework, a discriminator is trained to differentiate between state transitions generated by the policy and those in the reference dataset. When the discriminator fails, it implies that the policy has successfully produced transitions that are indistinguishable from the reference. GAN-based methods treat the policy as a generator and optimize a saddle-point problem inherent to adversarial learning. These methods scale naturally with diverse motion data, as they operate on short transition fragments rather than full trajectories—eliminating the need for explicit time alignment. Moreover, the discriminator implicitly induces a similarity space, where transitions that are behaviorally alike produce similar outputs. This captures inter-motion relationships better than methods relying on phase synchronization alone. As a result, policies trained with GAN-based objectives often exhibit smoother transitions between motions—especially compared to early methods like DeepMimic, which rely on hard switching between clips. However, since the discriminator provides a weaker reward signal than full-feature regression, these methods typically adapt better to the inclusion of additional task-specific rewards. AMP-based techniques have demonstrated effectiveness in both character animation (e.g., PHC [3]) and robotics (e.g., Alejandro [4]). Yet, adversarial training introduces two major challenges beyond general training instability:

Discriminator Saturation: The discriminator can quickly become overconfident, especially when the policy’s behavior initially diverges significantly from the reference data. This results in vanishing gradients, leaving the policy with no meaningful learning signal. This issue is particularly acute in complex settings like rough terrain or object manipulation, where early-stage policy rollouts may bear little resemblance to any reference behavior. Solutions such as Wasserstein-based objectives (e.g., WASABI [5], HumanMimic [6]) aim to retain useful gradients even in the face of a strong discriminator.

Mode Collapse: The policy may converge to producing a narrow subset of behaviors that suffice to fool the discriminator, ignoring the broader diversity present in the reference data. This leads to a failure in recovering the full range of demonstrated motions. While the discriminator does implicitly respect spatial relationships between motions, AMP lacks an explicit motion representation that could be conditioned on to enable controlled diversity.

To tackle these limitations, subsequent works have introduced mechanisms for unsupervised skill discovery. Methods like CASSI [7], ASE [8], and CALM [9] learn latent encodings that separate different motions in embedding space. Others, like Multi-AMP [10], CASE [11], and SMPLOlympics [12], use annotated motion categories to condition the discriminator on motion type. Distillation-based approaches such as HuMoR [13] and PULSE [14] leverage a variational bottleneck to learn structured latent spaces from motion data.

Revisiting Feature-Based Methods: Representation Learning to the Rescue

Despite their flexibility with diverse data, GAN-based methods come with substantial cost: tuning, stability management, mitigating discriminator saturation and mode collapse often demand significant engineering effort. Revisiting that core motivation—enabling smooth transitions and generalization across diverse motions—points back to the importance of a structured motion representation space. If such a latent space can be learned, then transitions between motions can be performed smoothly in that space, leading to better generalization beyond the reference set. This also simplifies reward construction, often reduced to weighted feature differences.

As a result, a new line of work has emerged: using feature-based methods enhanced with learned representations, avoiding adversarial training altogether. Some methods inject reference motions or key features directly into the policy (e.g., PhysHOI [15], ExBody [16], H2O [17], HumanPlus [18], MaskedMimic [19]). Others use policy-driven signals to learn encodings (VQ-PMC [20]), or apply self-supervised training to build temporally and spatially coherent motion embeddings (VMP [21], RobotMDM [22]). In particular, frequency-domain approaches (PAE [23], FLD [24], DFM [25]) introduce motion-inductive biases to capture meaningful temporal and spatial relationships. These methods can be viewed as generalized successors to DeepMimic: they align time across motions automatically, while also addressing motion similarity structurally rather than heuristically.

Strengths and Limitations

Now that we’ve covered both approaches, here’s a concise comparison of their respective strengths and limitations:

GAN-Based Methods

Strengths:

  • Inherently scalable to diverse motions: Operate on short transition snippets without requiring time alignment, making them naturally suitable for large, unstructured datasets.
  • Implicit structure through the discriminator: The adversarial setup learns a similarity space between transitions, which supports generalization across varied behaviors.
  • Flexible reward integration: The sparse, coarse signal from the discriminator combines well with task-specific rewards without over-constraining the policy.

Limitations:

  • Training instability: GANs are notoriously sensitive, often requiring delicate balancing between generator and discriminator.
  • Discriminator saturation: A strong discriminator too early in training can starve the policy of learning signal, especially in complex environments.
  • Mode collapse: The policy may overfit to a small subset of behaviors that are easy to fool the discriminator, leading to poor coverage of the reference data.
  • No explicit motion representation: Without conditioning on a learned latent space, transitions between behaviors lack structure and interpretability.

Feature-Based Methods

Strengths:

  • Strong learning signal: Frame-by-frame regression provides detailed supervision, making it easier to recover precise motion features.
  • Interpretability and control: Hand-crafted features and synchronized references offer fine-grained control over what is being imitated.
  • Latent conditioning enables diversity: When combined with a structured motion representation, these methods can support smooth transitions and broader generalization.

Limitations:

  • Requires careful inductive biases: Success hinges on engineering motion representations that respect temporal and spatial relationships—nontrivial in high-DOF systems.
  • Synchronization assumptions: Still reliant on some form of temporal alignment across motions unless a learned representation handles it implicitly.
  • Scaling with diversity is nontrivial: Although scalable in principle, representation learning and sampling strategies must be robust to handle very large or noisy datasets.

On Metrics and Misconceptions

Before moving on, it’s worth noting why I haven’t discussed terms like naturalness, energy efficiency, or cost of transport. These are often used to evaluate learned behaviors, but they don’t actually reflect how well a given algorithm performs. Improvements on these axes depend entirely on the quality of the reference data—not the algorithm itself. If a policy exhibits more natural motion or reduced energy usage, it’s a statement about the reference, not about whether GAN-based or feature-based learning is superior.

Debunking Common Beliefs

With that clarified, let’s address some common beliefs—many of them overstated or simply incorrect—about how GAN-based and feature-based methods compare:

GAN-based methods automatically develop a distance metric between reference and policy motions.

Yes, but that distance metric can be misleading. This is exactly what leads to discriminator saturation, where the discriminator consistently reports a large distance from the policy to the reference, regardless of actual improvement. It also explains mode collapse: a policy-generated motion is treated as close to all reference motions as long as it resembles just one of them.

GAN-based methods don’t require hand-crafted features.

No. The input to the discriminator must still be carefully chosen. Just like selecting features to compare policy and reference motions in feature-based methods, you need to decide which features the discriminator observes. If the selected features are insufficient, the policy will only imitate what’s visible. If too many are included, the discriminator becomes overly powerful and saturates quickly. This is especially problematic in settings like rough terrain locomotion or object manipulation, where including terrain or object features in the discriminator makes it nearly impossible for the policy to imitate early on.

GAN-based methods avoid hand-tuned reward weights for different features.

Not quite. While these methods don’t require explicit reward weighting, the features provided to the discriminator must still be scaled and normalized. This introduces an implicit form of weighting: the relative scales of input features directly affect the discriminator’s sensitivity to each dimension. So, although the weighting is no longer manual in the reward function, the problem reappears as a normalization issue in the discriminator input.

GAN-based methods yield smoother transitions between motions.

Only when compared to early methods like DeepMimic, which rely on hard switching and poorly modeled motion relationships. In contrast, feature-based methods that use structured motion representations can generate much smoother transitions. These representations encode temporal and spatial relationships between motions, and interpolation within the latent space produces continuous transitions—without mode collapse.

Only GAN-based methods can be combined with task rewards.

No, both methods can be combined with task rewards—and even serve as task rewards themselves. The main difference lies in the signal strength. Feature-based methods align reference and policy trajectories frame-by-frame, providing a dense and informative signal. GAN-based rewards are coarser, leading to better adaptability in some tasks, but typically lower fidelity to the reference motion. Adding more context (e.g., stacked frames) to the discriminator usually makes it stronger and more prone to saturation.

GAN-based methods deal better with unstructured or noisy reference motions.

Not really. While GAN-based methods tend to smooth out reference motions and thus disregard potential noise, they also lose important motion details. Feature-based methods can also handle noise—particularly during self-supervised pretraining—using variational priors to absorb unwanted variation in the reference data.

Feature-based methods generalize better to unseen motion inputs.

Not necessarily. Generalization depends on the quality of the motion representation and how the policy is trained. Constructing a latent space that meaningfully encodes motion similarity is difficult, especially due to the temporal nature of motion. Relying on spatial similarity alone fails to capture relationships between distant but related frames within a single motion.

Feature-based methods are easier to implement.

No. These methods require careful design of inductive biases to be effective. Even with a good representation, training brings challenges like latent collapse or poor disentanglement. Sampling from the learned space is also nontrivial: while random sampling increases coverage, it can produce infeasible or incoherent references that hurt policy training.

Feature-based methods scale better.

I would argue yes—if there existed a universal motion representation (like a motion tokenizer similar to texts) that encodes all possible behaviors in a latent space respecting both temporal and spatial relationships. Such a representation would allow policies to generalize more effectively when conditioned on it. While this remains an open challenge, it’s a promising direction for future work.

Final Thoughts

Many problems attributed to one class of methods often reappear in a different form in the other. Likewise, limitations in one method are not necessarily addressed by switching to another. While GAN-based methods often suffer from training instability and loss of motion detail, feature-based methods can be brittle due to the difficulty of designing appropriate inductive biases and representations. Choosing between them is not a matter of which is “better” in general, but which is more aligned with the constraints and priorities of your task—whether that’s fidelity, diversity, generalization, or training simplicity.

Chenhao Li
Authors
Embodied Intelligence and Robot Learning