Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training

Jargons

•

Scene graph

A scene graph is a general data structure commonly used by vector-based graphics editing applications and modern computer games, which arranges the logical and often spatial representation of a graphical scene. It is a collection of nodes in a graph or tree structure.

•

Scene layout

Spatial layout of objects

Task

Image generation from scene graph

Challenges

•

Previous studies used image-like representation of scene graphs, often in the scene layout

◦

- 
scene layouts are crafted manually and are not specifically designed to facilitate the alignment between images and graphs

Some relations such as behind, inside, and in front of, all corresponds to similar spatial relations in scene layouts

Goal

Image generation from scene graph without using scene layout

→ learning intermediate representations that explicitly maximize the alignment between scene graphs and images

Methods

•

Dataset

◦

Visual Genome (VG) dataset with 108,077 scene graph & image pairs

◦

COCO-Stuff dataset with pixel-wise annotations with 40,000 training images and 5,000 validation images with corresponding bounding boxes and segmentation masks

•

Method

◦

input : Scene graph sss & image xxx pairs

◦

output :  Synthetic image

◦

Detailed method

Stage 1 : Masked contrastive pre-training (i.e. SG encoder)

Stage 2 : Diffusion-based scene graph to image generation

Results

•

Comparison with other methods

•

Semantic image manipulation