Improving Black-Box Generative Attacks via Generator Semantic Consistency

Abstract

Transfer attacks optimize on a surrogate and deploy to a black-box target. While iterative optimization attacks in this paradigm are limited by their per-input cost limits efficiency and scalability due to multistep gradient updates for each input, generative attacks alleviate these by producing adversarial examples in a single forward pass at test time. However, current generative attacks still adhere to optimizing surrogate losses (e.g., feature divergence) and overlook the generator’s internal dynamics, underexploring how the generator’s internal representations shape transferable perturbations.

To address this, we enforce semantic consistency by aligning the early generator’s intermediate features to an exponential moving average (EMA) teacher, stabilizing object-aligned representations and improving black-box transfer without inference-time overhead. To ground the mechanism, we quantify semantic stability as the standard deviation of foreground IoU between cluster-derived activation masks and foreground masks across generator blocks, and observe reduced semantic drift under our method.

For more reliable evaluation, we also introduce Accidental Correction Rate (ACR) to separate inadvertent corrections from intended misclassifications, complementing the inherent blind spots in traditional Attack Success Rate (ASR), Fooling Rate (FR), and Accuracy metrics. Across architectures, domains, and tasks, our approach can be seamlessly integrated into existing generative attacks with consistent improvements in black-box transfer, while maintaining test-time efficiency.

Key Contributions

Generator–internal evidence for perturbation semantics. To investigate perturbation semantics within the generator, we partition the generator into early/mid/late blocks and quantify objectaligned semantics per block. Our analysis reveals that methods with lower variability in the foreground IoU across the intermediate blocks exhibit higher adversarial transfer. (§2.2)
Generator–level semantic consistency guidance. By enforcing training-only semantic consistency at the generator’s early intermediates, we achieve improved adversarial transfer while keeping the adversarial objective on the surrogate unchanged. The guidance can be seamlessly integrated into existing generative attacks without altering the test pipeline at no additional inference cost. (§3)
Comprehensive evaluation with an added reliability measure. We conduct a comprehensive transferability evaluation spanning classification (CLS) across architectures, domains, and dense prediction tasks (SS, OD). We also complement conventional Accuracy, ASR, and FR metrics by introducing a novel ACR metric to assess the attack reliability, measured by inadvertent corrections from intended misclassifications. (§4.2).

Motivation: Closer look into perturbation generator

Question 1: At what stage of perturbation synthesis do semantic cues deteriorate?

Question 2: Which generator blocks most influence transferability?

Our observations on the semantic variability within the perturbation generator:

(a) Generator intermediate feature maps for each block partition.

(b) Predicted masks from intermediate feature clusters on ImageNet-S from the baseline (BIA).

(c) Quantified variability in foreground IoU.

Our method distinction. Comparison of transfer-based generative adversarial attacks, highlighted by the method’s targeted stage in the training pipeline (left→right), and GT label requirement.

Proposed Method: Semantically Consistent Generative Attack (SCGA)

Overview of our proposed SCGA framework. Given a benign input image, a perturbation generator produces an adversarial output under the supervision of a Mean Teacher (MT) structure. The student and teacher share the generator architecture, with the teacher updated via EMA. Semantic consistency is enforced by aligning their intermediate features, selectively applied to the early blocks to effectively preserve structural information from the benign input across the residual blocks. The adversarial example is then evaluated against victim models according to the four evaluation metrics. This MT-based design further promotes semantic alignment, combining consistency and integrity, thereby enhancing adversarial transferability across diverse victims.

Perturbation Generation

Generation

Perturbed Image Generator

Evaluation Protocol

Evaluation

Limitation in evaluation protocol

Generation

Visualization of generator intermediate block-level differences with the baseline (BIA): raw feature differences on bottom, and thresholded on top (normalized for illustration purposes only). With our generator-internal semantic consistency mechanism, we progressively guide adversarial perturbation to focus on the salient object regions initially and gradually disperse to surrounding background regions.

Generation

Comparison of block-wise feature activations and class activation maps. Our SCGA-trained generator produces more visually perturbed adversarial noise towards the end of the perturbation stages.

Evaluation

New evaluation metric: Accidental Correction Rate (ACR)

In real world, adversarial perturbations induce different outcomes depending on whether the clean is inherently predicted as correct or incorrect and the perturbed image triggers victim models to output the prediction as correct or incorrect.

Complementary role to ASR. At ϵ_test = 4, ACR (incorrect→correct) is 14% while ASR (correct→incorrect) is 8%, yielding a net gain of 6% correct predictions. Because ASR accounts for only harmful flips, it overlooks this positive balance; ACR makes it explicit.

ACR is non-monotonic, unlike Acc/ASR/FR. It peaks at ε = 4 and then falls as stronger noise overwhelms corrective effects. This trend offers a very different view from the defender’s side: evaluation should consider not only how many errors an attack creates but also how many it inadvertently corrects.

Actionable insight. Defenders might tolerate or even harness low-budget perturbations that raise ACR, while attackers in safety-critical settings may need to penalize accidental corrections to avoid unintentionally improving model performance.

Our semantically consistent generative attack effectively exploits the generator intermediates to craft adversarial examples to enhance transferability from the baselines (● → ▼) across domains (a) and models (b).

Main Results

Qualitative results. Our semantically consistent generative attack successfully guides the generator to focus perturbations particularly on the semantically salient regions, effectively fooling the victim classifier. ①: (a) benign input image, (b) generated perturbation (normalized for visual purposes only), (c) unbounded adversarial image, and (d) bounded adversarial image across CUB-200-2011, Stanford Cars, FGVC Aircraft, and ImageNet-1K. The label on top (green) and bottom (orange) denotes the correct label and prediction after the attack, respectively ②: We highlight that our method induces Grad-CAM to focus on drastically different regions in our adversarial examples compared to both the benign image and the adversarial examples crafted by the baseline (BIA). Moreover, our approach noticeably spreads and reduces the high activation regions observed in the benign and baseline cases, enhancing the transferability of our adversarial perturbations. ③: Cross-task prediction results (SS on top, OD on bottom). Our approach further disrupts the victim models by triggering higher false positive rates and wrong class label predictions.

Semantic segmentation: we observe that our method does not just blur or hide parts of the road, but it actually makes the model stop recognizing entire road areas and even small objects like pedestrians or cars. The baseline attack might only erase a few isolated pixels or blend edges, but ours turns whole stretches of road into “ignore,” wiping out those predictions in one go. In other words, our method uniformly removes both large surfaces and tiny details, so the segmented map ends up missing key pieces of the scene that the baseline leaves untouched.

Object detection: our attack causes the model to stop predicting any localized boxes around objects (RoI), completely removing every predicted region of interest, whereas the baseline often leaves boxes in place or only shifts them slightly.

Ablation Study

Performance with a different generator architecture

U-Net-based generator: We show that our SCGA is generator architecture-agnostic by demonstrating consistent attack performance improvements with the U-Net architecture. We report the average improvement (Δ%p) for our components applied to different generator architectures and evaluated against semantic segmentation (mIoU ↓) and object detection (mAP50 ↓) models. MT denotes mean teacher, and better results in boldface.

Performance by surrogate mid-layer and τ

Performance by hyperparameters

Robustness to defense mechanisms

Against adversarially trained models (Adv.IncV3, Adv.ViT, Adv.ConvNeXt) and input processing (JPEG, BDR, R&P, Random rotation, smoothing kernel (smoothing), total variation minimization (TVM), random pixel deflection (PD)) defenses, our SCGA consistently enhances transfer attack performance over the baseline (BIA).

Robustness to defense and MLLMs

Extension to targeted generative attacks

Overall Performance

Across all four cross-settings, our method consistently enhances the transfer attack performance.

Citation

@inproceedings{jeong2026scga,
                title={Improving Black-Box Generative Attacks via Generator Semantic Consistency},
                author={Jeong, Jongoh and Yang, Hunmin and Jeong, Jaeseok and Yoon, Kuk-Jin},
                booktitle={International Conference on Learning Representations (ICLR)},
                year={2026}
                url={https://openreview.net/forum?id=ibXhUapwcz}
              }