Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training

Tags
Generative model
Affiliation
Peking University
Article type
Research
Date
2023/08/10
Journal
Arxiv
Published Year
2022
keywords
Diffusion
Computer vision

Jargons

Scene graph
scene graph is a general data structure commonly used by vector-based graphics editing applications and modern computer games, which arranges the logical and often spatial representation of a graphical scene. It is a collection of nodes in a graph or tree structure.
Scene layout
Spatial layout of objects

Task

Image generation from scene graph

Challenges

Previous studies used image-like representation of scene graphs, often in the scene layout
- scene layouts are crafted manually and are not specifically designed to facilitate the alignment between images and graphs
Some relations such as behind, inside, and in front of, all corresponds to similar spatial relations in scene layouts

Goal

Image generation from scene graph without using scene layout
learning intermediate representations that explicitly maximize the alignment between scene graphs and images

Methods

Dataset
Visual Genome (VG) dataset with 108,077 scene graph & image pairs
COCO-Stuff dataset with pixel-wise annotations with 40,000 training images and 5,000 validation images with corresponding bounding boxes and segmentation masks
Method
input : Scene graph ss & image xx pairs
output : Synthetic image
Detailed method
Stage 1 : Masked contrastive pre-training (i.e. SG encoder)
Stage 2 : Diffusion-based scene graph to image generation

Results

Comparison with other methods
Semantic image manipulation