Fill-Up

Abstract

Modern text-to-image synthesis models have achieved an exceptional level of photorealism, generating high-quality images from arbitrary text descriptions. In light of the impressive synthesis ability, several studies have exhibited promising results in exploiting generated data for image recognition. However, directly supplementing data-hungry situations in the real-world (e.g. few-shot or long-tailed scenarios) with existing approaches result in marginal performance gains, as they suffer to thoroughly reflect the distribution of the real data. Through extensive experiments, this paper proposes a new image synthesis pipeline for long-tailed situations using Textual Inversion. The study demonstrates that generated images from textual-inverted text tokens effectively aligns with the real domain, significantly enhancing the recognition ability of a standard ResNet50 backbone. We also show that real-world data imbalance scenarios can be successfully mitigated by filling up the imbalanced data with synthetic images. In conjunction with techniques in the area of long-tailed recognition, our method achieves state-of-the-art results on standard long-tailed benchmarks when trained from scratch.

Exploring Diverse Generation Methods for Fill-Up

Recent large-scale generative models allow various ways to generate synthetic images. To find the most efficient way to generate synthetic samples, we try prompt-to-image (Single template, CLIP templates, T5, Flan T5-XXL), image-to-image (Image Variation, Stable Diffusion Reimagine), and transmodal methods (Captioning model and Textual Inversion). We use Textual Inversion as our main generation method as it shows the best performance. More details can be found in the paper.

Benchmark Tables on Standard Long-tailed Datasets

Experimental results for ImageNet-LT and iNaturalist2018-LT.
^* denotes models trained with longer epochs (over 200).

Experimental results for Places-LT (left) and different synthetic data generation methods on ImageNet-LT (right).
^† denotes models trained from scratch using ResNet50 backbone.

Per Class Accuracies of Fill-Up

Compared to other long-tailed techniques, Fill-Up achieves high accuracy on all classes. It excels in learning minority class distribution by employing per-class optimization of text-tokens, surpassing traditional synthetic sample generation methods.

Impact of Initial Word Choice for Textual Inversion

Regardless of the initial word choice, we observe Textual Inversion optimization processes converging to similar results. This allows our method to effectively capture the image distribution without any prior knowledge from the text domain (i.e. without any class-related text information).

Generated Images

Synthetic samples from ImageNet generated through different methods

Synthetic samples from iNaturalist2018 and Places-LT

In general, Textual Inversion method generates images with higher diversity and fidelity than other methods, resulting in improved alignment with the real data distribution. More images and details for selected class letters can be found in the paper.

BibTex

@article{shin2023fill,
  title   = {{Fill-Up: Balancing Long-Tailed Data with Generative Models}},
  author  = {Shin, Joonghyuk and Kang, Minguk and Park, Jaesik},
  journal = {arXiv preprint arXiv:2306.07200},
  year    = {2023}
}

Fill-Up: Balancing Long-Tailed Data with Generative Models