We present Dress-ED, the first large-scale benchmark unifying VTON, VTOFF, and text-guided garment editing within a single, semantically consistent framework — driven by natural language. No prior dataset jointly supports instruction-driven VTON and VTOFF. Dress-ED defines this task from scratch, providing 146,460 verified quadruplets across 7 edit types and 3 garment categories. Quality is ensured by a 4-stage automated pipeline that achieves 95.6% human-model agreement. Alongside the benchmark, we introduce Dress-EM, a unified multimodal diffusion baseline applicable to both VTON and VTOFF, which outperforms all baselines across all tasks and metrics.
Figure 1. Dress-ED unifies VTON, VTOFF, and text-guided garment editing within a single semantically consistent framework.
Built on top of Dress Code, the pipeline converts static person-garment pairs into richly annotated editing quadruplets — at scale, without any manual labeling.
A multimodal LLM jointly processes the in-shop garment and the on-model person image to produce a structured JSON of garment attributes. Rule-based templates synthesise paired editing instructions covering all 7 edit types.
Qwen3-VL
Given each (garment, instruction) pair, a state-of-the-art diffusion editor synthesises the edited in-shop garment. This stage produces ~300k raw candidates across all categories before quality filtering.
FLUX.2 Klein
The edited garment is virtually tried on the original person image using a high-fidelity VTON model. A specialised segmentation model ensures hands and body parts are correctly preserved during the try-on.
FitDiT
Each sample is scored by GPT-5 for instruction adherence, content preservation, and realism. A fine-tuned InternVL-3.5 distils these judgments and filters samples below a score of 80 — retaining ~146k verified quadruplets.
GPT-5 + InternVL-3.5
In a blind user study (221 samples), human annotators and the fine-tuned InternVL verifier agreed on 95.6% of cases. 77.7% were jointly rated "good" and 17.9% jointly rated "bad" — confirming the filter correctly identifies low-quality generations without discarding valid data.
Dress-EM is an architecture built for edited virtual try-on and edited virtual try-off, trained on our annotated benchmark. It encompasses a MLLM to extract the multimodal editing prompt, and a trainable lightweight connector to project the multimodal embeddings into the DiT input space.
Dress-EM consistently leads across all 7 edit categories — from coarse color and material changes to fine-grained structural modifications — demonstrating that a single unified model can flexibly handle the full diversity of the Dress-ED benchmark without task-specific tuning.
@article{dressED2026,
title={Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off},
author={Sanguigni, Fulvio and Lobba, Davide and Ren, Bin and Cornia, Marcella and Sebe, Nicu and Cucchiara, Rita},
journal={arXiv preprint arXiv:2603.22607},
year={2026}
}