arXiv Preprint · 2026

Dress-ED: Instruction-Guided Editing
for Virtual Try-On and Try-Off

Fulvio Sanguigni1,3,*  ·  Davide Lobba2,3,*  ·  Bin Ren2,3  ·  Marcella Cornia1  ·  Nicu Sebe2  ·  Rita Cucchiara1
1University of Modena  ·  2University of Trento  ·  3University of Pisa
*Equal contribution
arXiv Paper GitHub (soon) Dataset Stats Results
146k
Verified quadruplets
49k
Distinct garment IDs
7
Edit types
3
Garment categories
95.6%
Filter accuracy

A benchmark to unify virtual try-on, virtual try-off
and instruction-based editing

Abstract

We present Dress-ED, the first large-scale benchmark unifying VTON, VTOFF, and text-guided garment editing within a single, semantically consistent framework — driven by natural language. No prior dataset jointly supports instruction-driven VTON and VTOFF. Dress-ED defines this task from scratch, providing 146,460 verified quadruplets across 7 edit types and 3 garment categories. Quality is ensured by a 4-stage automated pipeline that achieves 95.6% human-model agreement. Alongside the benchmark, we introduce Dress-EM, a unified multimodal diffusion baseline applicable to both VTON and VTOFF, which outperforms all baselines across all tasks and metrics.

Dress-ED teaser figure

Figure 1. Dress-ED unifies VTON, VTOFF, and text-guided garment editing within a single semantically consistent framework.

Four automated stages,
one scalable pipeline

Built on top of Dress Code, the pipeline converts static person-garment pairs into richly annotated editing quadruplets — at scale, without any manual labeling.

I

Attribute Extraction & Instruction Generation

A multimodal LLM jointly processes the in-shop garment and the on-model person image to produce a structured JSON of garment attributes. Rule-based templates synthesise paired editing instructions covering all 7 edit types.

Qwen3-VL
Stage 1 figure
II

Edited Garment Generation

Given each (garment, instruction) pair, a state-of-the-art diffusion editor synthesises the edited in-shop garment. This stage produces ~300k raw candidates across all categories before quality filtering.

FLUX.2 Klein
Stage 2 figure
III

Edited Try-On Generation

The edited garment is virtually tried on the original person image using a high-fidelity VTON model. A specialised segmentation model ensures hands and body parts are correctly preserved during the try-on.

FitDiT
Stage 3 figure
IV

LLM-Guided Quality Verification

Each sample is scored by GPT-5 for instruction adherence, content preservation, and realism. A fine-tuned InternVL-3.5 distils these judgments and filters samples below a score of 80 — retaining ~146k verified quadruplets.

GPT-5 + InternVL-3.5
Stage 4 figure

Verification accuracy: 95.6%

In a blind user study (221 samples), human annotators and the fine-tuned InternVL verifier agreed on 95.6% of cases. 77.7% were jointly rated "good" and 17.9% jointly rated "bad" — confirming the filter correctly identifies low-quality generations without discarding valid data.

95.6% human-model agreement

146,460 samples across three garment categories

Add Detail
0
27% of dataset
Change Pattern
0
27% of dataset
Change Color
0
20% of dataset
Modify Structure
0
11% of dataset
Change Material
0
10% of dataset
Remove Element
0
4% of dataset
Fine-Grained
0
1% of dataset

Browse Dress-ED samples

Edit type
Garment category
Original Garment
Original garment
Original Person
Original person
Edited Garment
Edited garment
Edited Person
Edited person

Dress-EM: a unified multimodal diffusion baseline

Dress-EM architecture diagram

Dress-EM is an architecture built for edited virtual try-on and edited virtual try-off, trained on our annotated benchmark. It encompasses a MLLM to extract the multimodal editing prompt, and a trainable lightweight connector to project the multimodal embeddings into the DiT input space.

Dress-EM outperforms all baselines
across every edit type

Per-edit-type DINO-I — Dress-EM vs best baseline (Edited VTON)

Semantic consistency across all 7 edit tasks · higher is better

Dress-EM consistently leads across all 7 edit categories — from coarse color and material changes to fine-grained structural modifications — demonstrating that a single unified model can flexibly handle the full diversity of the Dress-ED benchmark without task-specific tuning.

BibTeX

@article{dressED2026,
  title={Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off},
  author={Sanguigni, Fulvio and Lobba, Davide and Ren, Bin and Cornia, Marcella and Sebe, Nicu and Cucchiara, Rita},
  journal={arXiv preprint arXiv:2603.22607},
  year={2026}
}