Image Gallery

Introduction

I paid for OpenAI's Dall-E 3 to generate some images and for OpenAI's GPT-4V to describe these images.

For Dall-E 3, each prompt submitted is the result of 20 choices in the following template:

(An oil painting/A photorealistic image) of a (happy/sad) (old/young) (man/woman) wearing a (blue/red) (hat/pair of shoes), (sitting/lying) on (grass/a couch) and (reading a book/drinking wine). The (foreground/background) includes a (dog running/bird flying) near (roses/sunflowers) and (silverware/jewelry). The (top/bottom) (left/right) quarter includes (a Christian cross/an Islamic star and crescent) symbol. The scene is set in a (warm/cold) (morning/evening) in the (seventeenth/twentieth) century and has (low/high)-saturation colors.

There are 2^20 = 1,048,576 such combinations. Each generation costs $0,08 and takes a few seconds to produce. Given time and budget constraints, I generated only 100 images. However, generating combinations randomly will very likely produce images that differ by around 10 modalities, and very rarely by 0, 1, 2 or 18, 19, 20. Thus, to cover the full spectrum, the following process was used:

Most images are not perfect. Some are even very strange, or display impossible features. Some may not respect the prompt completely. As a reminder, Dall-E 3 revises the submitted prompt, which I also provide. This is the real prompt used to generate the images. It might be a good idea to check manually, and write that down somewhere, what has been respected and what has not.

Nevertheless, this can still serve as a basis to develop and study image descriptor models.

To describe the images, I paid for OpenAI's GPT-4V model. I give it a role, I ask it a question, and receive its answer. So far, here are the roles attributed: