Gaussian Glasses: Do Diffusion Priors Help Classification Under Corruption and Camouflage?

Kevin Zhu

Final project for 6.7960, MIT

Example of overfitting in RL-trained model

Introduction

Since the emergence of Denoising Diffusion Probabilistic Models (DDPMs), generative vision has evolved from simple synthesis to precise, instructional editing . Models like InstructPix2Pix and ControlNet have demonstrated that diffusion priors can manipulate image semantics with remarkable fidelity, often producing edits that appear scene-consistent to humans.

This progress motivates a central question in computer vision: can a diffusion prior be used as a label-preserving semantic denoiser for recognition? Concretely, we ask whether diffusion-based reconstruction can restore discriminative structure in ambiguous or corrupted inputs, improving downstream classification by transforming the data rather than modifying the classifier.

In this work, we investigate whether Stable Diffusion v1.5 , optimized through both Supervised Fine-Tuning (SFT) and Reinforcement Learning (DDPO ), can improve the zero-shot classification performance of CLIP on two domains of visual degradation:

Natural Ambiguity: The COD10K (Camouflaged Object Detection) dataset , where the signal (animal) is hidden or camouflaged.
Algorithmic Corruption (Preliminary): The CIFAR-10-C and ImageNet-C datasets , representing algorithmic noise and blur degradations. We began extending our pipeline to these settings near the end of the project, but did not reach stable, interpretable behavior within our compute budget and timeline.

For COD10K, we ran full evaluations and did not observe a statistically significant improvement over the base CLIP baseline using our SFT- or RL-adapted diffusion models. We interpret this as evidence that, under camouflage-driven ambiguity, diffusion editing tends to trade semantic faithfulness for plausible completion.

Roadmap of Our Approach

Initial plan: RL for black-box optimization. We first trained Stable Diffusion with DDPO so we would not need access to the internals of the vision classifier, enabling optimization against arbitrary (potentially non-differentiable) recognition systems.
Realization: RL did not translate into accuracy gains. While reward could be made to increase, the resulting diffusion-edited images did not close the gap to the frozen CLIP baseline on COD10K.
Benchmark: SFT to reduce variance and test the core idea. We then implemented supervised fine-tuning by backpropagating through the diffusion pipeline into a frozen CLIP classifier, removing policy gradient variance and testing whether "diffusion as a preprocessor" can work when gradients are available.
Validation: stop chasing peak accuracy and validate the pipeline. After SFT also fell short of the CLIP baseline, we did not attempt to maximize accuracy via alternate classifiers or extensive model selection (not the focus of this work). Instead, we focused on careful reward-function design, hyperparameter sweeps, and correctness checks of our ImageDDPO implementation.
Pivot: corruption tasks are better matched to denoising. After substantial effort on COD10K, we began extending the same machinery to corruption benchmarks (CIFAR-10-C and ImageNet-C), where the degradation more closely resembles the denoising bias of diffusion models.
Constraint: ran out of time for stable corruption results. Those corruption experiments remained preliminary by the end of the project, so we treat them as a motivated direction for future work rather than a completed empirical claim.

For hypotheses about why this approach fails under camouflage-driven ambiguity, refer to the Failure Analysis.

All code for this experiment is available here. For readers to build on this work, we also provide a standalone package implementing ImageDDPO available with pip install imageddpo.

Preliminaries

Diffusion

In this work, we adopt conditional diffusion probabilistic models, , which define a distribution $ p(x_0 | c) $ over data samples $ x_0 $ given an associated context $ c $. These models operate by introducing a forward noising process $ q(x_t | x_{t-1}) $, a Markov chain that incrementally corrupts the data with noise across many timesteps. The learning task is to approximate the reverse of this corruption process.

To achieve this, a neural network $ \mu_\theta(x_t, c, t) $ is trained to predict the posterior mean of the forward process at each timestep. Training proceeds by sampling a data pair $ (x_0, c) $, selecting a timestep $ t $ uniformly, and generating a noisy latent $ x_t $ using the forward diffusion kernel. The model minimizes the difference between its prediction and the true posterior mean $ \tilde{\mu}(x_0, t) $:

$$ L_{\text{DDPM}}(\theta) = \mathbb{E} \left[ \| \tilde{\mu}(x_0, t) - \mu_\theta(x_t, c, t) \|^2 \right]. $$

This objective essentially maximizes a variational lower bound on the data log-likelihood, which grounds the procedure in a principled generative modeling framework. Because the forward process is fixed and analytically tractable, the model’s task is to estimate the reverse denoising dynamics.

At sampling time, generation begins from Gaussian noise $ x_T \sim \mathcal{N}(0, I) $. The model then iteratively applies the learned reverse transitions $ p_\theta(x_{t-1} | x_t, c) $, gradually removing noise and steering the sample toward the data distribution. Most widely used samplers implement this reverse update as an isotropic Gaussian with a timestep-dependent variance:

$$ p_\theta(x_{t-1} | x_t, c) = \mathcal{N}(x_{t-1} | \mu_\theta(x_t, c, t), \sigma_t^2 I). $$

By chaining these denoising steps together, diffusion models have shown the ability to generate high-quality samples using an iterative, noise-to-data trajectory that is fully guided by the learned model.

Reinforcement Learning and Policy Gradient Methods

This work also explores the use of reinforcement learning to optimize the diffusion model via the DDPO algorithm, enabling optimization around a hypothetical black box classifier.

Reinforcement Learning is a type of optimization framework whereby an agent learns to make decisions (via a policy) in an environment to maximize cumulative reward. As framed, the agent does not need to understand everything about its environment — nor any details about the reward function’s implementation — in order to maximize it. This enables a) black box optimization and b) optimization of arbitrary, non-differentiable objectives.

Policy Gradient Methods are a class of reinforcement learning algorithms that maximize expected reward by directly optimizing policy parameters. Policy gradient methods require two core components:

A stochastic policy $ \pi_\theta $ (e.g. a neural net with probabilistic outputs)
An advantage estimator $ \hat{A}_t $ (measuring how good an action was compared to average)

The standard objective policy gradient methods seek to maximize is: $$ \mathcal{J}^{\text{PG}}(\theta) = \hat{\mathbb{E}}_t[\log \pi_\theta (a_t|s_t) \hat{A}_t] $$ $ \pi_\theta(a_t|s_t) $ is the current policy, trained to predict action $ a_t $ from state $ s_t $. The Advantage function $ \hat{A}_t $ quantifies how much better a specific action was compared to the "average" action in that state and acts as a stable training signal (compared to reward).

In code (e.g. PyTorch), we implement this using autograd by minimizing the negative objective: $ \mathcal{L}^{\text{PG}}=-\mathcal{J}^{\text{PG}} $.

PPO (Proximal Policy Optimization) extends policy gradient methods by clipping the probability ratio between the new and old policy, effectively building a trust region for the range of allowed policy updates on each optimization step .

DDPOTrainer

DDPO frames the diffusion model as an agent in an environment, where the action is the diffusion step, the state is the partially denoised image + text prompt conditioning, and the reward can be any function (e.g. aesthetic quality, compressibility, classification accuracy, etc.).

The HuggingFace TRL library provides a (now deprecated) DDPOTrainer class to implement this learning algorithm. At a high level it works as follows:

Initialize trainer with pipeline, scheduler, reward, config.
Sample trajectories by running the diffusion process and recording actions + log-probs.
Decode final latents to images.
Compute rewards for each sample.
Normalize advantages globally or per-prompt.
Compute PPO loss using replayed log-probs + current policy.
Update UNet via LoRA, keeping scheduler + VAE fixed.
Repeat for many epochs (sample → PPO → update).

In this work, we adapt the DDPOTrainer class to support image inputs, so that Stable Diffusion can be used as an image editor, rather than just an image generator. In the methods section, we'll discuss the modifications we made to the DDPOTrainer class.

Task and Dataset Details

We looked for tasks where diffusion could 'assist' the classifier, by clearing out noise or enhancing the signal. Therefore, we looked for tasks with visually occluded images.

Camouflage Removal on COD10K

One primary example of this is the COD10K dataset , which contains images of camouflaged animals in natural settings. These images can confuse the classifier, as the animals are often blended into the background, such as the deer in the brush as can be seen below.

COD10K Sample Image (deer)

The goal is to use diffusion to enhance the visibility and detectability of these camouflaged animals, thereby increasing the accuracy of a frozen classifier. Additionally, the result of the diffusion would likely be very interpretable, highlighting the areas of the image that correspond to the animal.

Corruption Removal on CIFAR-10-C

Another task we explored is corruption removal on the CIFAR-10-C dataset , which contains images from CIFAR-10 that have been corrupted with various types of noise and blur, including Gaussian noise and motion blur. An example image for the truck label, corrupted with Gaussian noise is shown below.

CIFAR-10-C Sample Image (Truck corrupted with Gaussian noise)

The dataset was accessed from the paper's GitHub repository. To assess the baseline performance of CLIP on the different noise types, we evaluated its zero-shot classification accuracy on each corruption category. The results for the CIFAR-10-C corruption types we considered, with the highest severity (5), are summarized in the following table:

CIFAR-10C Corruption Type (Severity 5)	CLIP Accuracy (%)
Glass Blur	43.0
Gaussian Noise	43.0
Speckle Noise	50.0
Shot Noise	48.0
Elastic Transform	56.0
Pixelate	58.0
Contrast	62.0
Impulse Noise	61.0
JPEG Compression	60.0
Gaussian Blur	69.0
Motion Blur	70.0

In general, the small size of the image and the resulting severity of the corruptions makes the images difficult to interpret. However, the nature of diffusion models as denoisers suggests that the diffusion prior could help recover some of the lost structure, improving classification accuracy. One particular concern with the CIFAR-10 dataset is the significantly smaller image size compared to the usual input dimensions to stable diffusion (32x32 vs 224x224).

Corruption Removal on ImageNet-C

To address this concern, we also explored corruption removal on the ImageNet-C dataset , which contains images from ImageNet that have been corrupted. Similar to CIFAR-10-C, the corruptions include Gaussian noise, motion blur, and glass blur. An example of an image labeled as "ostrich, Struthio camelus", corrupted with max severity (5) fog is shown below.

ImageNet-C Sample Image (Ostrich with severity 5 fog)

The dataset was accessed from the paper's GitHub repository. The labels were not provided in the ImageNet-C dataset, but are the standard ImageNet-1000 labels and were downloaded separately from ImageNet. Again, like with CIFAR-10-C, to assess the baseline performance of CLIP on each candidate noise type, we evaluated its zero-shot accuracy. The results for the ImageNet-C corruption types we considered, with the highest severity (5), are summarized in the following table:

ImageNet-C Corruption Types (Severity 5)	CLIP Accuracy (%)
Impulse Noise	8.9%
Glass Blur	10.1%
Gaussian Noise	14.1%
Shot Noise	16.6%
Fog	25.1%
Pixelate	28.4%
Elastic Transform	29.7%
Zoom Blur	31.8%
Snow	35.9%
Motion Blur	36.5%
JPEG Compression	38.2%
Frost	39.3%

Having a larger image size, while still also having a challenging but more noise-like corruption profile, makes this dataset a good candidate for diffusion-based denoising to help classification.

SFT and RL training pipelines

Design Choices

Universal Design Choices

Vision Backbone (CLIP). We selected CLIP as the downstream classifier because its open-vocabulary interface allows a single model to evaluate COD10K without retraining. This enabled rapid experimentation with prompt strategies for both evaluation and generation. We used openai/clip-vit-base-patch16 because its smaller patch size improved reward sensitivity on camouflaged scenes.
Generative Backbone (Stable Diffusion 1.5). We adopted SD 1.5 for its balance of generation quality and inference speed, and because it integrates cleanly into the existing DDPO codepath and our img2img adaptation.
LoRA Adapter. Following recent work on the performance of LoRA, , we fine-tuned the UNet using a LoRA adapter to keep training lightweight and isolate learned behavior to a small set of parameters. This was essential for iteration speed across both RL and SFT.

RL-Specific Design Choices

Reward Form. We evaluated several reward formulations and found that logit-difference rewards were the most stable. Probability-based rewards were too flat to guide learning, and margin-style rewards were unstable in scale.
Perceptual Drift Penalty. We used LPIPS to penalize perceptual drift between the input and reconstructed output. In practice, this penalty was often too aggressive early in training, pushing rewards negative before useful behavior emerged.

Reinforcement Learning (ImageDDPO)

ImageDDPO

ImageDDPO extends DDPO to the img2img setting by optimizing a diffusion policy that edits an input image instead of generating purely from noise.

MDP and Objective Changes

Multimodal context. The conditioning becomes $ c = (\text{text_prompt},\, \text{input_image}) $, so the policy learns edits relative to a source image.
Non-noise initial state. Instead of $ x_T \sim \mathcal{N}(0, I) $, we encode the input image, then apply forward diffusion to reach $ x_{sT} $ with noise strength $ s \in [0,1] $.
Shortened horizon. The denoising trajectory runs from $ t = sT $ down to $ t = 0 $ (e.g. $ s = 0.4 \Rightarrow t = 400 \to 0 $), but each transition is still a Gaussian $ p_\theta(x_{t-1} \mid x_t, c) $, so DDPO’s log-prob and PPO machinery apply unchanged over this truncated chain.

Implementation Changes from DDPO

Img2img DDPO pipeline. Img2ImgDDPOStableDiffusionPipeline wraps the base DDPO pipeline to: (i) VAE-encode the input image, (ii) forward diffuse to $ x_{sT} $, and (iii) denoise from $ t = sT $ to $ 0 $ while recording latents and log-probs.
Noise strength hyperparameter. A scalar $ s \in [0,1] $ controls how far back we start in the diffusion chain, defining the effective MDP horizon and the set of actions included in the PPO loss.
Prompt and reward interfaces. The prompt function yields tuples $ (\text{init_images},\, \text{text_prompts},\, \text{metadata}) $, and reward functions are upgraded to $ r(x_0, c) = r(x_0, \text{init_image}, \text{text_prompt}, \text{metadata}) $, enabling image-conditioned objectives.

Reward and Training Loop

Our reinforcement learning setup extends the ImageDDPOTrainer with three task-specific modifications. First, we define a reward equal to the logit difference between the edited and original images for the ground-truth class, which was more stable than probability- or margin-based alternatives. Second, we integrate distributed training via accelerate, allowing rollouts, reward evaluation, and PPO updates to run efficiently. Third, we add a validation hook that periodically reconstructs diffusion intermediates to monitor whether training meaningfully changes the denoising trajectory; Figure 1 shows an example intermediate.

Supervised Fine-Tuning (SFT)

Our SFT method trains a LoRA adapter on the UNet component of Stable Diffusion by directly minimizing CLIP classification loss. Unlike RL, SFT maintains the full computational graph through the diffusion pipeline and backpropagates through the VAE decoder and frozen CLIP classifier to obtain exact gradients with respect to the classification objective. To keep the sampling path differentiable, we use deterministic DDIM sampling ($ \eta = 0 $).

Architecture

The SFT forward pass is:

$$ \begin{gathered} \text{Image} \to \text{VAE Encode} \to \text{Add Noise} \to \text{Denoise w/ UNet (LoRA)} \to \text{VAE Decode} \to \text{CLIP} \to \text{Cross-Entropy Loss} \end{gathered} $$

We freeze all components except the LoRA parameters applied to the UNet's attention layers. Specifically, LoRA adapters are applied to to_q, to_k, to_v, and to_out.0 in both self-attention and cross-attention blocks. We use a LoRA rank of 4 and alpha of 4, making less than 1% of the UNet parameters trainable.

Differentiable Denoising Loop

We run an image-to-image (I2I) denoising loop by encoding the input image into latents, adding noise up to a start timestep determined by noise_strength, then denoising deterministically for the remaining steps using DDIM ($ \eta = 0 $). This preserves gradient flow across all denoising operations while allowing targeted edits that remain anchored to the input.

Loss

After decoding, we classify the edited image with CLIP and minimize cross-entropy between predicted logits and the ground-truth class label. We pre-compute CLIP text embeddings for each class using prompts of the form "An image of {class_name}" and compute logits using CLIP's temperature-scaled cosine similarity. Critically, while gradients flow backward through the frozen CLIP classifier during training, CLIP itself receives no parameter updates — only the UNet LoRA weights are modified.

Experiments

In all our experiments we fine-tune only UNet LoRA weights and keep the VAE and text encoder frozen. We explore both empty prompts and task prompts for diffusion conditioning. For SFT, we use deterministic DDIM sampling to preserve differentiability. For RL, we sample rollouts and perform PPO-style updates using the logit-difference reward.

Prompts

We experiment with three prompt strategies for conditioning the diffusion model during training and evaluation: (1) ORACLE, where the prompt contains the ground-truth label (e.g., "A clear photo of {class}"), which serves as an upper bound but often induces reward-hacking behavior by generating images that no longer resemble the input; (2) custom fixed task prompts, which provide a consistent instruction across all samples; and (3) empty prompts, which remove text guidance entirely and test whether the model can learn edits solely from the image-conditioned objective.

For the custom fixed prompts, we use:

"De-camouflage the animal" — a direct semantic instruction to remove camouflage effects.
"Increase edge sharpness and contrast of the camouflaged animal" — emphasizes low-level visual features rather than semantics.
"Enhance the visibility of the camouflaged animal" — a general instruction that leaves the enhancement mechanism unspecified.

Hyperparameter Ranges

The following table summarizes the hyperparameter ranges that produced the most stable training behavior in our runs. These are reported to make the experimental search space explicit and reproducible.

Hyperparameter	RL (ImageDDPO)	SFT
Learning rate	1e-4	1e-4
Noise strength	0.2 - 0.4	0.4
Guidance scale	0 (null prompts) or 7.0 (task prompts)	1.0
Batch size	Effective 256 per gradient update (using gradient accumulation)	Effective 16 per gradient update
Epochs / training length	Up to 500 (reward continued climbing)	30 epochs
Train dataset size	256 - 1024 (larger improves validation, slower reward climb)	Full train split used for reported runs

Evaluation

We evaluate on a withheld COD10K test set of 4000 images. We report top-1, top-3, and top-5 accuracy under a fixed CLIP evaluation protocol and compare four conditions:

CLIP Baseline: CLIP zero-shot classification on the original images (no diffusion).
BASE SD + CLIP: images edited by an untuned Stable Diffusion pipeline, then classified by CLIP.
RL-Trained SD + CLIP: images edited by the RL-optimized diffusion pipeline, then classified by CLIP.
SFT-Trained SD + CLIP: images edited by the SFT-optimized diffusion pipeline, then classified by CLIP.

All conditions use the same CLIP model (openai/clip-vit-base-patch16) and preprocessing pipeline to ensure fair comparison. Images are resized to 224×224 using bicubic interpolation and normalized using CLIP's standard statistics (mean: [0.4815, 0.4578, 0.4082], std: [0.2686, 0.2613, 0.2758]). We pre-compute normalized text embeddings for all classes using prompts of the form "An image of {class_name}" and compute classification logits using CLIP's learned temperature-scaled cosine similarity.

Results

Camouflage Task

Table 1 shows classification accuracy results on the COD10K test set across four experimental conditions: CLIP baseline, BASE SD + CLIP, RL-trained SD + CLIP, and SFT-trained SD + CLIP.

Method	Top-1 Accuracy	Top-3 Accuracy	Top-5 Accuracy
CLIP Baseline	44.57%	68.51%	77.00%
BASE SD + CLIP (Null Prompt)	11.40%	26.60%	35.98%
RL-Trained SD + CLIP (Null Prompt)	14.51%	32.18%	41.07%
SFT-Trained SD + CLIP (Null Prompt)	25.32%	45.95%	56.07%

RL and SFT improve classification accuracy over an untuned SD + CLIP pipeline. However, a zero-shot CLIP baseline still significantly exceeds the performance of all the trained pairs.

Results: SFT-Trained Model

This model was trained for 30 epochs using the supervised fine-tuning method described above. This model uses empty prompts for the diffusion model, which — as we discuss below — perform with no meaningful difference compared to custom prompts.

Before
Grasshopper

After
Grasshopper

Before
Frog

After
Frog

Before
Spider

After
Spider

'Good' Generations

Before
Turtle

After
Owl

Before
Snake

After
Frog

Before
Spider

After
Moth

'Bad' Generations

The first row shows a series of "good" generations from the diffusion model, while the second row shows "bad" generations that lead to misclassifications from the same model. The "good" generations show some promising behavior, such as a blurring of the background which enhances the camouflaged object. This is apparent in the frog and spider image pairs, where the background is noticeably blurred relative to the camouflaged object. The grasshopper's background is initially blurred, but the diffusion model additionally blends and simplifies the background colors.

In each image pair of the "bad" generations, the diffusion model seems to misidentify the camouflaged object as a different class and denoise towards that data distribution. This explains why the SFT-trained model performs worse than the baseline camouflage dataset images. This is very noticeable in the snake-frog image pair, where the snake totally disappears and some resemblance of a frog appears. The spider-moth image pair shows another interesting behavior, where a "moth" object is seemingly generated independently. It is worth noting that even to the human eye, the camouflage images for the "bad" generations are considerably harder, although they all belong to the same COD10K dataset without difficulty distinctions.

Validation loss comparison

Validation accuracy comparison

We also compared performance across different prompt strategies. Figure 8 shows validation loss and accuracy curves for models trained with custom prompts ("De-camouflage the animal", "Increase edge sharpness and contrast", "Enhance the visibility of the camouflaged animal") versus the no-prompt baseline. Despite the semantic differences in prompt wording, all custom prompts performed similarly to each other and showed no meaningful improvement over the no-prompt condition.

Before
Cat

After
Cat

Before
Frog

After
Frog

Before
Mantis

After
Mantis

ORACLE Prompt Generations

As expected, the ORACLE prompt performed the best, reaching close to 100% accuracy on the test set. However, the diffusion model tends to cheat, generating images unrecognizable from the original image to minimize the classification loss for the desired class.

Results: RL-Trained Model

The RL results summarized in Table 1 use the LoRA checkpoint from the green run in Figure 10. Although this run shows steady reward improvement, the reward signal measures relative logit increase after editing rather than absolute classification correctness. As a consequence, rising reward does not translate into improved top-k accuracy when evaluated against the CLIP baseline.

We conducted extensive hyperparameter sweeps and over one hundred training runs across different dataset sizes, prompt regimes, and noise strengths. Despite these efforts, RL did not yield accuracy improvements on the camouflage task. This holds even for long training horizons, where reward continues climbing but the edited images do not meaningfully aid CLIP in identifying the camouflaged animal.

Overall, the RL pipeline successfully optimized its own reward function but did not produce diffusion edits that improved classification performance on COD10K, indicating a misalignment between the reward and the downstream objective.

Reward Progress Across Training Configurations

Corruption Task (Preliminary)

We began extending our pipeline to the corruption task (CIFAR-10-C and ImageNet-C) but were not able to obtain stable or interpretable generations within the project timeline. Across the two hyperparameter configurations we tested, the diffusion model quickly collapsed to heavily corrupted outputs and did not improve CLIP accuracy, so we do not report quantitative results for this setting.

Nevertheless, we view CIFAR-10-C (and ImageNet-C at higher resolution) as more promising testbeds than COD10K for future work. Their corruption structure better matches the denoising bias of diffusion models and enables stronger reward signals via CIFAR- or ImageNet-adapted vision classifiers, rather than relying on an open-vocabulary CLIP model. In our limited testing, reward was consistently climbing at a slow rate of $4e-4$ per image sampled.

Discussion

Results Overview

On the COD10K task, our SFT and RL-trained generative models achieve 25.32% and 14.51% top-1 accuracy, respectively. While this falls short of the 44.57% CLIP baseline, both represent a significant improvement over the untrained BASE SD + CLIP combination, with SFT more than doubling its 11.40% accuracy.

Failure Analysis

The fundamental challenge stems from tasking a generative diffusion model with an implicitly discriminative objective. Diffusion models are not pretrained for semantic label preservation; rather, they are optimized to produce clean, high-fidelity images from noise. This structural circularity renders the task fundamentally ill-posed: the model implicitly requires the semantic label to correctly denoise the image, yet the purpose of the denoising is to recover that very label.

When presented with the label ambiguity inherent in camouflaged images, the model performs stochastic sampling over plausible choices. Consequently, if the model initially misidentifies the subject, it denoises toward the wrong class distribution, effectively hallucinating a high-quality but semantically incorrect object. This failure mode is evidenced by the 'Bad' generations in Figure 7.

From an optimization perspective, this approach also struggles with credit assignment. Propagating gradients or rewards through 10-20 denoising steps is computationally expensive and obscures the causal link between specific denoising actions and the final classification. Furthermore, the reward signal itself is sparse and exhibits high variance, bounded by the relatively weak performance (< 50%) of the CLIP baseline on this specific dataset. This combination makes the optimization landscape difficult to traverse and highly sample-inefficient.

Extensions

We propose several extensions to address these limitations and improve performance:

Hyperparameter Tuning: Systematic optimization of noise_strength, learning_rate, and guidance_scale (in order of expected impact).
Compute and Data: Scaling up to longer training runs and expanding the dataset size to improve generalization.
Confidence-Based Gating: Implementing a mechanism to only apply diffusion modifications to images where the baseline classifier has low confidence.
Annealing ORACLE Prompt: Initializing generation with a high guidance_scale using the ground-truth label (ORACLE), then gradually annealing the guidance to transition the model from prompt-dependence to autonomous detection.
New Tasks: Applying this generative-classification approach to other domains, such as finishing the corruption task, or classifying partial drawings by generating the complete scene.

Conclusion

Our investigation yields a negative result: employing a diffusion prior for semantic denoising fails to improve zero-shot classification performance on camouflaged data and, in preliminary experiments, on corrupted data.

While Stable Diffusion adapted with both supervised fine-tuning and ImageDDPO exhibited improvements over a base, unadapted generator, the inherent stochasticity of the diffusion process distracts from the final goal of improving classification. This results in the hallucination of new features rather than the enhancement of existing ones. When weighed against the substantial computational overhead of iterative denoising, this accuracy-corruption tradeoff renders the approach ineffective compared to standard discriminative baselines (or more direct methods like SFT or linear probing on the classifier).

We present these findings to discourage similar architectural choices in the future: for tasks requiring precise semantic preservation under ambiguity, current generative priors are ill-suited to serve as pre-processing filters.

References:

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML 2015.
Ho, J., & Salimans, T. (2022). Classifier-Free Diffusion Guidance. arXiv:2207.12598.
Brooks, T., et al. (2023). InstructPix2Pix: Learning to Follow Image Editing Instructions. arXiv:2211.09800.
Zhang, L., et al. (2023). Adding Conditional Control to Text-to-Image Diffusion Models. arXiv:2302.05543.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022.
Song, J., Meng, C., & Ermon, S. (2021). Denoising Diffusion Implicit Models. ICLR 2021.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021.
Black, K., Janner, M., Du, Y., Kostrikov, I., & Levine, S. (2023). Training Diffusion Models with Reinforcement Learning. NeurIPS 2023.
Fan, D.-P., et al. (2021). Concealed Object Detection. arXiv:2102.10274.
Hendrycks, D., & Dietterich, T. (2019). Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. arXiv:1903.12261.
Arxiv Insights (2018). An Introduction to Policy Gradient Methods. YouTube.
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.
HuggingFace (n.d.). Denoising Diffusion Policy Optimization. HuggingFace TRL Documentation.
Schulman, J. (2025). LoRA Without Regret. Thinking Machines Blog.