arXiv Preprint · 2026

Dress-ED: Instruction-Guided Editing
for Virtual Try-On and Try-Off

Fulvio Sanguigni^1,3,* · Davide Lobba^2,3,* · Bin Ren^2,3 · Marcella Cornia¹ · Nicu Sebe² · Rita Cucchiara¹

¹University of Modena · ²University of Trento · ³University of Pisa
^*Equal contribution

arXiv Paper GitHub (soon) Dataset Stats Results

146k

Verified quadruplets

49k

Distinct garment IDs

Edit types

Garment categories

95.6%

Filter accuracy

Overview

A benchmark to unify virtual try-on, virtual try-off
and instruction-based editing

Abstract

We present Dress-ED, the first large-scale benchmark unifying VTON, VTOFF, and text-guided garment editing within a single, semantically consistent framework — driven by natural language. No prior dataset jointly supports instruction-driven VTON and VTOFF. Dress-ED defines this task from scratch, providing 146,460 verified quadruplets across 7 edit types and 3 garment categories. Quality is ensured by a 4-stage automated pipeline that achieves 95.6% human-model agreement. Alongside the benchmark, we introduce Dress-EM, a unified multimodal diffusion baseline applicable to both VTON and VTOFF, which outperforms all baselines across all tasks and metrics.

Figure 1. Dress-ED unifies VTON, VTOFF, and text-guided garment editing within a single semantically consistent framework.

Data generation pipeline

Four automated stages,
one scalable pipeline

Built on top of Dress Code, the pipeline converts static person-garment pairs into richly annotated editing quadruplets — at scale, without any manual labeling.

Attribute Extraction & Instruction Generation

A multimodal LLM jointly processes the in-shop garment and the on-model person image to produce a structured JSON of garment attributes. Rule-based templates synthesise paired editing instructions covering all 7 edit types.

Qwen3-VL

Edited Garment Generation

Given each (garment, instruction) pair, a state-of-the-art diffusion editor synthesises the edited in-shop garment. This stage produces ~300k raw candidates across all categories before quality filtering.

FLUX.2 Klein

III

Edited Try-On Generation

The edited garment is virtually tried on the original person image using a high-fidelity VTON model. A specialised segmentation model ensures hands and body parts are correctly preserved during the try-on.

FitDiT

LLM-Guided Quality Verification

Each sample is scored by GPT-5 for instruction adherence, content preservation, and realism. A fine-tuned InternVL-3.5 distils these judgments and filters samples below a score of 80 — retaining ~146k verified quadruplets.

GPT-5 + InternVL-3.5

Verification accuracy: 95.6%

In a blind user study (221 samples), human annotators and the fine-tuned InternVL verifier agreed on 95.6% of cases. 77.7% were jointly rated "good" and 17.9% jointly rated "bad" — confirming the filter correctly identifies low-quality generations without discarding valid data.

95.6% human-model agreement

Dataset statistics

146,460 samples across three garment categories

Add Detail

27% of dataset

Change Pattern

27% of dataset

Change Color

20% of dataset

Modify Structure

11% of dataset

Change Material

10% of dataset

Remove Element

4% of dataset

Fine-Grained

1% of dataset

Dataset showcase

Browse Dress-ED samples

Edit type

Garment category

Original Garment

Original Person

Edited Garment

Edited Person

Proposed method

Dress-EM: a unified multimodal diffusion baseline

Dress-EM is an architecture built for edited virtual try-on and edited virtual try-off, trained on our annotated benchmark. It encompasses a MLLM to extract the multimodal editing prompt, and a trainable lightweight connector to project the multimodal embeddings into the DiT input space.

Quantitative comparison

Dress-EM outperforms all baselines
across every edit type

Per-edit-type DINO-I — Dress-EM vs best baseline (Edited VTON)

Semantic consistency across all 7 edit tasks · higher is better

Dress-EM consistently leads across all 7 edit categories — from coarse color and material changes to fine-grained structural modifications — demonstrating that a single unified model can flexibly handle the full diversity of the Dress-ED benchmark without task-specific tuning.

Citation

BibTeX

@article{dressED2026,
  title={Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off},
  author={Sanguigni, Fulvio and Lobba, Davide and Ren, Bin and Cornia, Marcella and Sebe, Nicu and Cucchiara, Rita},
  journal={arXiv preprint arXiv:2603.22607},
  year={2026}
}

Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off

A benchmark to unify virtual try-on, virtual try-offand instruction-based editing

Four automated stages,one scalable pipeline