| Gaussian Glasses: Do Diffusion Priors Help Classification Under Corruption and Camouflage? | |||||
| Hector Astrom | Kevin Zhu | Luc Gaitskell | |||
| Final project for 6.7960, MIT | |||||
| Gaussian Glasses: Do Diffusion Priors Help Classification Under Corruption and Camouflage? | |||||
| Hector Astrom | Kevin Zhu | Luc Gaitskell | |||
| Final project for 6.7960, MIT | |||||
Example of overfitting in RL-trained model
pip install imageddpo.
In this work, we adopt conditional diffusion probabilistic models, , which define a distribution \( p(x_0 | c) \) over data samples \( x_0 \) given an associated context \( c \). These models operate by introducing a forward noising process \( q(x_t | x_{t-1}) \), a Markov chain that incrementally corrupts the data with noise across many timesteps. The learning task is to approximate the reverse of this corruption process.
To achieve this, a neural network \( \mu_\theta(x_t, c, t) \) is trained to predict the posterior mean of the forward process at each timestep. Training proceeds by sampling a data pair \( (x_0, c) \), selecting a timestep \( t \) uniformly, and generating a noisy latent \( x_t \) using the forward diffusion kernel. The model minimizes the difference between its prediction and the true posterior mean \( \tilde{\mu}(x_0, t) \):
$$ L_{\text{DDPM}}(\theta) = \mathbb{E} \left[ \| \tilde{\mu}(x_0, t) - \mu_\theta(x_t, c, t) \|^2 \right]. $$This objective essentially maximizes a variational lower bound on the data log-likelihood, which grounds the procedure in a principled generative modeling framework. Because the forward process is fixed and analytically tractable, the model’s task is to estimate the reverse denoising dynamics.
At sampling time, generation begins from Gaussian noise \( x_T \sim \mathcal{N}(0, I) \). The model then iteratively applies the learned reverse transitions \( p_\theta(x_{t-1} | x_t, c) \), gradually removing noise and steering the sample toward the data distribution. Most widely used samplers implement this reverse update as an isotropic Gaussian with a timestep-dependent variance:
$$ p_\theta(x_{t-1} | x_t, c) = \mathcal{N}(x_{t-1} | \mu_\theta(x_t, c, t), \sigma_t^2 I). $$By chaining these denoising steps together, diffusion models have shown the ability to generate high-quality samples using an iterative, noise-to-data trajectory that is fully guided by the learned model.
DDPOTrainer class to implement this
learning algorithm. At a high level it works
as follows:
COD10K Sample Image (deer)
CIFAR-10-C Sample Image (Truck corrupted with Gaussian noise)
| CIFAR-10C Corruption Type (Severity 5) | CLIP Accuracy (%) |
|---|---|
| Glass Blur | 43.0 |
| Gaussian Noise | 43.0 |
| Speckle Noise | 50.0 |
| Shot Noise | 48.0 |
| Elastic Transform | 56.0 |
| Pixelate | 58.0 |
| Contrast | 62.0 |
| Impulse Noise | 61.0 |
| JPEG Compression | 60.0 |
| Gaussian Blur | 69.0 |
| Motion Blur | 70.0 |
ImageNet-C Sample Image (Ostrich with severity 5 fog)
| ImageNet-C Corruption Types (Severity 5) | CLIP Accuracy (%) |
|---|---|
| Impulse Noise | 8.9% |
| Glass Blur | 10.1% |
| Gaussian Noise | 14.1% |
| Shot Noise | 16.6% |
| Fog | 25.1% |
| Pixelate | 28.4% |
| Elastic Transform | 29.7% |
| Zoom Blur | 31.8% |
| Snow | 35.9% |
| Motion Blur | 36.5% |
| JPEG Compression | 38.2% |
| Frost | 39.3% |
SFT and RL training pipelines
openai/clip-vit-base-patch16 because its smaller patch
size
improved reward sensitivity on camouflaged scenes.
ImageDDPO extends DDPO to the img2img setting by optimizing a diffusion policy that edits an input image instead of generating purely from noise.
MDP and Objective Changes
Implementation Changes from DDPO
Img2ImgDDPOStableDiffusionPipeline wraps the base DDPO pipeline to:
(i) VAE-encode the input image, (ii) forward diffuse to \( x_{sT} \),
and (iii) denoise from \( t = sT \) to \( 0 \) while recording latents and log-probs.
Our reinforcement learning setup extends the ImageDDPOTrainer with three task-specific
modifications.
First, we define a reward equal to the logit difference between the edited and original images for the
ground-truth
class, which was more stable than probability- or margin-based alternatives. Second, we integrate
distributed
training via accelerate, allowing rollouts, reward evaluation, and PPO updates to run
efficiently.
Third, we add a validation hook that periodically reconstructs diffusion intermediates to monitor whether
training
meaningfully changes the denoising trajectory; Figure 1 shows an example
intermediate.
Our SFT method trains a LoRA adapter on the UNet component of Stable Diffusion by directly minimizing CLIP classification loss. Unlike RL, SFT maintains the full computational graph through the diffusion pipeline and backpropagates through the VAE decoder and frozen CLIP classifier to obtain exact gradients with respect to the classification objective. To keep the sampling path differentiable, we use deterministic DDIM sampling (\( \eta = 0 \)).
The SFT forward pass is:
$$ \begin{gathered} \text{Image} \to \text{VAE Encode} \to \text{Add Noise} \to \text{Denoise w/ UNet (LoRA)} \to \text{VAE Decode} \to \text{CLIP} \to \text{Cross-Entropy Loss} \end{gathered} $$
We freeze all components except the LoRA parameters applied to the UNet's attention layers. Specifically,
LoRA
adapters are applied to to_q, to_k, to_v, and to_out.0
in
both self-attention and cross-attention blocks. We use a LoRA rank of 4 and alpha of 4, making less than 1%
of
the UNet parameters trainable.
We run an image-to-image (I2I) denoising loop by encoding the input image into latents, adding noise up to a
start
timestep determined by noise_strength, then denoising deterministically for the remaining steps
using DDIM
(\( \eta = 0 \)). This preserves gradient flow across all denoising operations while allowing targeted edits
that remain
anchored to the input.
After decoding, we classify the edited image with CLIP and minimize cross-entropy between predicted logits and the ground-truth class label. We pre-compute CLIP text embeddings for each class using prompts of the form "An image of {class_name}" and compute logits using CLIP's temperature-scaled cosine similarity. Critically, while gradients flow backward through the frozen CLIP classifier during training, CLIP itself receives no parameter updates — only the UNet LoRA weights are modified.
In all our experiments we fine-tune only UNet LoRA weights and keep the VAE and text encoder frozen. We explore both empty prompts and task prompts for diffusion conditioning. For SFT, we use deterministic DDIM sampling to preserve differentiability. For RL, we sample rollouts and perform PPO-style updates using the logit-difference reward.
We experiment with three prompt strategies for conditioning the diffusion model during training and evaluation: (1) ORACLE, where the prompt contains the ground-truth label (e.g., "A clear photo of {class}"), which serves as an upper bound but often induces reward-hacking behavior by generating images that no longer resemble the input; (2) custom fixed task prompts, which provide a consistent instruction across all samples; and (3) empty prompts, which remove text guidance entirely and test whether the model can learn edits solely from the image-conditioned objective.
For the custom fixed prompts, we use:
The following table summarizes the hyperparameter ranges that produced the most stable training behavior in our runs. These are reported to make the experimental search space explicit and reproducible.
| Hyperparameter | RL (ImageDDPO) | SFT |
|---|---|---|
| Learning rate | 1e-4 | 1e-4 |
| Noise strength | 0.2 - 0.4 | 0.4 |
| Guidance scale | 0 (null prompts) or 7.0 (task prompts) | 1.0 |
| Batch size | Effective 256 per gradient update (using gradient accumulation) | Effective 16 per gradient update |
| Epochs / training length | Up to 500 (reward continued climbing) | 30 epochs |
| Train dataset size | 256 - 1024 (larger improves validation, slower reward climb) | Full train split used for reported runs |
We evaluate on a withheld COD10K test set of 4000 images. We report top-1, top-3, and top-5 accuracy under a fixed CLIP evaluation protocol and compare four conditions:
All conditions use the same CLIP model (openai/clip-vit-base-patch16) and preprocessing
pipeline to ensure
fair comparison. Images are resized to 224×224 using bicubic interpolation and normalized using CLIP's
standard
statistics (mean: [0.4815, 0.4578, 0.4082], std: [0.2686, 0.2613, 0.2758]).
We pre-compute normalized text embeddings for all classes using prompts of the form "An image of
{class_name}" and compute
classification logits using CLIP's learned temperature-scaled cosine similarity.
Table 1 shows classification accuracy results on the COD10K test set across four experimental conditions: CLIP baseline, BASE SD + CLIP, RL-trained SD + CLIP, and SFT-trained SD + CLIP.
| Method | Top-1 Accuracy | Top-3 Accuracy | Top-5 Accuracy |
|---|---|---|---|
| CLIP Baseline | 44.57% | 68.51% | 77.00% |
| BASE SD + CLIP (Null Prompt) | 11.40% | 26.60% | 35.98% |
| RL-Trained SD + CLIP (Null Prompt) | 14.51% | 32.18% | 41.07% |
| SFT-Trained SD + CLIP (Null Prompt) | 25.32% | 45.95% | 56.07% |
RL and SFT improve classification accuracy over an untuned SD + CLIP pipeline. However, a zero-shot CLIP baseline still significantly exceeds the performance of all the trained pairs.
This model was trained for 30 epochs using the supervised fine-tuning method described above. This model uses empty prompts for the diffusion model, which — as we discuss below — perform with no meaningful difference compared to custom prompts.
Before
Grasshopper
After
Grasshopper
Before
Frog
After
Frog
Before
Spider
After
Spider
'Good' Generations
Before
Turtle
After
Owl
Before
Snake
After
Frog
Before
Spider
After
Moth
'Bad' Generations
The first row shows a series of "good" generations from the diffusion model, while the second row shows "bad" generations that lead to misclassifications from the same model. The "good" generations show some promising behavior, such as a blurring of the background which enhances the camouflaged object. This is apparent in the frog and spider image pairs, where the background is noticeably blurred relative to the camouflaged object. The grasshopper's background is initially blurred, but the diffusion model additionally blends and simplifies the background colors.
In each image pair of the "bad" generations, the diffusion model seems to misidentify the camouflaged object as a different class and denoise towards that data distribution. This explains why the SFT-trained model performs worse than the baseline camouflage dataset images. This is very noticeable in the snake-frog image pair, where the snake totally disappears and some resemblance of a frog appears. The spider-moth image pair shows another interesting behavior, where a "moth" object is seemingly generated independently. It is worth noting that even to the human eye, the camouflage images for the "bad" generations are considerably harder, although they all belong to the same COD10K dataset without difficulty distinctions.
Validation loss comparison
Validation accuracy comparison
We also compared performance across different prompt strategies. Figure 8 shows validation loss and accuracy curves for models trained with custom prompts ("De-camouflage the animal", "Increase edge sharpness and contrast", "Enhance the visibility of the camouflaged animal") versus the no-prompt baseline. Despite the semantic differences in prompt wording, all custom prompts performed similarly to each other and showed no meaningful improvement over the no-prompt condition.
Before
Cat
After
Cat
Before
Frog
After
Frog
Before
Mantis
After
Mantis
ORACLE Prompt Generations
As expected, the ORACLE prompt performed the best, reaching close to 100% accuracy on the test set. However, the diffusion model tends to cheat, generating images unrecognizable from the original image to minimize the classification loss for the desired class.
The RL results summarized in Table 1 use the LoRA checkpoint from the green run in Figure 10. Although this run shows steady reward improvement, the reward signal measures relative logit increase after editing rather than absolute classification correctness. As a consequence, rising reward does not translate into improved top-k accuracy when evaluated against the CLIP baseline.
We conducted extensive hyperparameter sweeps and over one hundred training runs across different dataset sizes, prompt regimes, and noise strengths. Despite these efforts, RL did not yield accuracy improvements on the camouflage task. This holds even for long training horizons, where reward continues climbing but the edited images do not meaningfully aid CLIP in identifying the camouflaged animal.
Overall, the RL pipeline successfully optimized its own reward function but did not produce diffusion edits that improved classification performance on COD10K, indicating a misalignment between the reward and the downstream objective.
Reward Progress Across Training Configurations
On the COD10K task, our SFT and RL-trained generative models achieve 25.32% and 14.51% top-1 accuracy, respectively. While this falls short of the 44.57% CLIP baseline, both represent a significant improvement over the untrained BASE SD + CLIP combination, with SFT more than doubling its 11.40% accuracy.
The fundamental challenge stems from tasking a generative
diffusion model with an implicitly discriminative objective.
Diffusion models are not pretrained for semantic label
preservation; rather, they are optimized to produce clean,
high-fidelity images from noise. This structural circularity
renders the task fundamentally ill-posed: the model implicitly
requires the semantic label to correctly denoise the image, yet
the purpose of the denoising is to recover that very label.
When presented with the label ambiguity inherent in camouflaged
images, the model performs stochastic sampling over plausible
choices. Consequently, if the model initially misidentifies the
subject, it denoises toward the wrong class distribution,
effectively hallucinating a high-quality but semantically
incorrect object. This failure mode is evidenced by the 'Bad'
generations in Figure 7.
From an optimization perspective, this approach also struggles with credit assignment. Propagating gradients or rewards through 10-20 denoising steps is computationally expensive and obscures the causal link between specific denoising actions and the final classification. Furthermore, the reward signal itself is sparse and exhibits high variance, bounded by the relatively weak performance (< 50%) of the CLIP baseline on this specific dataset. This combination makes the optimization landscape difficult to traverse and highly sample-inefficient.
We propose several extensions to address these limitations and improve performance:
noise_strength, learning_rate, and
guidance_scale (in order of expected impact).
guidance_scale using the ground-truth label (ORACLE), then
gradually annealing the guidance to transition the model from prompt-dependence
to autonomous detection.Our investigation yields a negative result: employing a diffusion prior for semantic denoising fails to improve zero-shot classification performance on camouflaged data and, in preliminary experiments, on corrupted data.
While Stable Diffusion adapted with both supervised fine-tuning and ImageDDPO exhibited improvements over a base, unadapted generator, the inherent stochasticity of the diffusion process distracts from the final goal of improving classification. This results in the hallucination of new features rather than the enhancement of existing ones. When weighed against the substantial computational overhead of iterative denoising, this accuracy-corruption tradeoff renders the approach ineffective compared to standard discriminative baselines (or more direct methods like SFT or linear probing on the classifier).
We present these findings to discourage similar architectural choices in the future: for tasks requiring precise semantic preservation under ambiguity, current generative priors are ill-suited to serve as pre-processing filters.