Introduction
I paid for OpenAI's Dall-E 3 to generate some images and for OpenAI's GPT-4V
to describe these images.
For Dall-E 3, each prompt submitted is the result
of 20 choices in the following template:
(An oil painting/A photorealistic image) of a (happy/sad) (old/young) (man/woman)
wearing a (blue/red) (hat/pair of shoes), (sitting/lying) on (grass/a couch) and
(reading a book/drinking wine).
The (foreground/background) includes a (dog running/bird flying) near (roses/sunflowers)
and (silverware/jewelry). The (top/bottom) (left/right) quarter includes
(a Christian cross/an Islamic star and crescent) symbol. The scene is set in a
(warm/cold) (morning/evening) in the (seventeenth/twentieth) century and has
(low/high)-saturation colors.
There are 2^20 = 1,048,576 such combinations. Each generation costs $0,08 and takes
a few seconds to produce. Given time and budget constraints, I generated only 100 images.
However, generating combinations randomly will very likely produce images that differ by
around 10 modalities, and very rarely by 0, 1, 2 or 18, 19, 20. Thus, to cover
the full spectrum, the following process was used:
- Each image is accompanied by its exact opposite. Hence, pairs of images (even ID, odd ID)
such as (Image 0, Image 1), (Image 2, Image 3), etc..., have completely opposite prompts.
- There are 96 unique prompts. The prompt of Image 0 (resp. 1, 42, 43) is the same
as the prompt of Image 2 (resp. 3, 44, 45).
- Images 0 and 2 are generated from a same randomly sampled combination of modalities.
Then, Image 4 differs from Image 2 by only one modality.
Then, Image 6 differs from Image 4 by only (another) one modality (and thus
differs from Image 2 by two modalities), etc., up to Image 40.
Note: images are generated from prompts only, they are not modified following
the previous image as input. This is not yet possible with Dall-E 3.
Images 1, 3, 5,..., 41 are their opposites, and of course two consecutive such images
differ by one modality as well (except Images 1 and 3 that have the same prompt).
Images 42 to 83 are generated following the same process.
- The remaining 16 images (84-99) (note: we start counting at 0) are pairs of
opposite images generated from randomly sampled combination of modalities.
- By construction, each choice per modality has been selected exactly 50 times.
So, "blue" and "red" are both in 50 prompts, "man" and "woman" as well, etc.
Most images are not perfect. Some are even very strange, or display impossible features.
Some may not respect the prompt completely. As a reminder, Dall-E 3 revises the submitted
prompt, which I also provide. This is the real prompt used to generate the images.
It might be a good idea to check manually, and write that down somewhere,
what has been respected and what has not.
Nevertheless, this can still serve as a basis to develop and study image descriptor models.
To describe the images, I paid for OpenAI's GPT-4V model. I give it a role, I ask it a
question, and receive its answer. So far, here are the roles attributed:
- Art Describer: You are a professional Art Describer.
Your responsibilities are as follows. You must provide detailed and accurate
descriptions of paintings based on visual observations. Focus on objectively
capturing the visual elements, including colors, shapes, subjects, and composition.
Be as complete as possible. Avoid offering personal interpretations, emotions,
or attempts to recognize the historical context, author, title, or date of the paintings.
Clearly communicate visual details in a precise and neutral manner. As a reminder,
the role of an Art Describer should emphasize the objective and factual aspects of
describing visual content while refraining from providing personal opinions,
emotional responses, or historical context.