RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

论文地址

摘要

Text-to-image generation has recently witnessed remarkable achievements. We introduce a text-conditional image diffusion model, termed RAPHAEL, to generate highly artistic images, which accurately portray the text prompts, encompassing multiple nouns, adjectives, and verbs. This is achieved by stacking tens of mixture-of-experts (MoEs) layers, i.e., space-MoE and time-MoE layers, enabling billions of diffusion paths (routes) from the network input to the output. Each path intuitively functions as a “painter” for depicting a particular textual concept onto a specified image region at a diffusion timestep. Comprehensive experiments reveal that RAPHAEL outperforms recent cutting-edge models, such as Stable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, and DALL-E 2, in terms of both image quality and aesthetic appeal. Firstly, RAPHAEL exhibits superior performance in switching images across diverse styles, such as Japanese comics, realism, cyberpunk, and ink illustration. Secondly, a single model with three billion parameters, trained on 1,000 A100 GPUs for two months, achieves a state-of-the-art zero-shot FID score of 6.61 on the COCO dataset. Furthermore, RAPHAEL significantly surpasses its counterparts in human evaluation on the ViLG-300 benchmark. We believe that RAPHAEL holds the potential to propel the frontiers of image generation research in both academia and industry, paving the way for future breakthroughs in this rapidly evolving field. More details can be found on a project webpage: this https URL.

Zero-shot Learning

先解释一下什么是Zero-shot Learning？从字面上来看，即是对某（些）类别完全不提供训练样本，也就是没有标注样本的迁移任务被称为零次学习。

zero-shot learning是为了能够识别在测试中出现，但在训练中没有遇到过的数据类别，我们可以学习到一个映射X->Y。如果这个映射足够好的话，我们就可以处理没有看到的类了，故可以被认为是迁移学习。

举个通俗的例子：假设斑马是未见过的类别，但根据描述和过去知识的印象即马（和马相似）、老虎（有条纹）、熊猫（颜色）相似进行推理出斑马的具体形态，从而能对新对象进行辨认。（如下图所示）零次学习就是希望能够模仿人类的这个推理过程，使得计算机具有识别新事物的能力。

One-shot Learning

什么是One-shot Learning？one-shot learning即是对某（些）类别只提供一个或者少量的训练样本，也就是说只有一个标注样本的迁移任务被称为一次学习。

one-shot learning指的是我们在训练样本很少，甚至只有一个的情况下，依旧能做预测。要点就在于学到好的X->Y的映射关系，然后应用到其他问题上。

one-shot learning其实和zero-shot learning类似，只不过zero-shot learning提供的是无标注的样本而one-shot learning会提供少量或一个样本，也可以称为few-shot learning。来自网上的一个定义：

FID score

解释一下就是: 这个模型中，只有少量的有 label （标签）的训练样本 S ，S 中包括 N 个样本，yi 代表各样本的 label。因为测试样本集中每个样本都有一个正确的类别，我们希望，再来新的待分类的测试样本 x’ 时候，正确预测出 x’ 标签是 y’。

注：把每个类别 yi 的单个样本换成 k个样本就变成了k-shot learning , few-shot 一般指的是 k 不超过 20。参考：few-shot learning是什么

参考

1、初探zero-shot/one-shot learning

2、【Stable Diffusion】FID、CLIP、cfg-scales都是什么

一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

明柳梦少

“A day without dancing is a betrayal of life.” ——Nietzsche