When faced with a bit of downtime, many of my friends will turn to the same party game. It’s based on the Surrealists’ Exquisite Corpse, and involves translating brief written descriptions into rapidly made drawings and back again. One group calls it Telephone Pictionary; another refers to it as Writey-Drawey. The internet tells me it is also called Eat Poop You Cat, a sequence of words surely inspired by one of the game’s results. While Exquisite Corpse can generate moments of absurdity, they rarely match the hilarity of Eat Poop You Cat. That hilarity is not always communicable (I’ve tried and failed to convey the gut-shaking silliness of one particularly memorable round in which the phrase panic at the disco gradually morphed into a well-drawn rager of deviant worms—some vomiting, some smoking, some fucking, one sipping from a novelty beer hat). But as a rule, bad translations tend to tickle, and shifting back and forth from the verbal to the visual amplifies these humorous juxtapositions and incongruities.
As recently as three years ago, it was rare to encounter oddities of text-to-image or image-to-text mistranslations in daily life, which made the outrageous outcomes of Eat Poop You Cat feel especially novel. But we have since entered a new era of image making, powered by tools that explicitly rely on these kinds of multimodal translation. With the aid of AI image generators like DALL-E 3, Stable Diffusion, and Midjourney, and the generative features integrated into Adobe’s Creative Cloud programs, you can now transform a sentence or phrase into a highly detailed image — or several — in mere seconds. Images, likewise, can be nearly instantly translated into descriptive text. Today, you can play Eat Poop You Cat alone in your room, cavorting with the algorithms.
Back in the summer of 2023, I tried it, using a browser-based version of Stable Diffusion — an open-source text-to-image generative AI (or GenAI) model — and another AI browser application called CLIP Interrogator, which translates any image into a text prompt. It took about three minutes to play two rounds of the game: anywhere from twenty seconds to two minutes to generate an image from a prompt, and another minute or so to process that image back into a text prompt. I kicked things off by typing “Eat Poop You Cat” (why not?) into a field that encouraged me to “Enter your prompt.” Then I clicked “Generate Image.”
The Stable Diffusion applet generates four images in response to any prompt; I cheated slightly by just choosing my favorite to proceed. From the center of the frame, a decently realistic tabby cat stared me down, green eyes glowing wide, mouth hanging open to display a salmon-pink tongue. The background was grungy gray without much detail; in the image’s lower third was some bubbly white text in emphatic all caps: EAT EAT POOOOP POOP YU NOU SOME YOU!
I dragged this image into CLIP Interrogator, which spat back the prompt: “a close-up of a cat with green eyes, blue text that says 3kliksphilip, epic urban bakground, poop, white border and background, licking out, epic poster, office cubicle background, golden toilet, funny cartoonish, erin, classic gem, messy eater, exploitable image, leave, motivational, moving poetry, toilet.” A nuanced syntax for image-generating prompts has emerged alongside the development of GenAI tools, and CLIP Interrogator’s “prompt” mimicked that accretionary layering of styles, details, and descriptors — though this list felt excessive, like a psychedelic extrapolation of the image, which I was glad to know was already a “classic gem.” Apparently 3kliksphilip is a gaming YouTuber, but who is Erin? Why is background spelled incorrectly the first time, but correctly the second? Where did the golden toilet come from? Leave?
After a few more back-and-forths I ended up with an image of a black-and-brown cat lounging on a commode that could have been designed by Frank Lloyd Wright. A bit of toilet paper, which had fallen onto the cat’s head from the roll above, approximated a hat. The image was flat and looked painted; dark contour lines conveyed the forms. The style felt familiar — Expressionist? German Expressionist? Faux-naïf? Influenced, certainly, by Modigliani, early Picasso, some of the later still lifes by the Polish Cubist Henri Hayden. Adjacent to the cat and the toilet, a copper cup balanced on a matching copper saucer. A series of blocky, all-caps words spilled across the floor: RULE OF TWO FREE PROSDY.
DALL-E has concretized both the portmanteau gesture and the name’s subliminal meanings: it’s a surrealist garbage compactor that performs mash-ups.
Tweet
Precisely 197.8 seconds later, CLIP Interrogator described this tableau as “a painting of a cat sitting on a toilet, playstation 2 gameplay still, in style of pop-art, by Ignacy Witkiewicz, the fool tarot, inspired by Phil Foglio, punkdrone, molecular gastronomy, app, bong, persona 5, text: roborock, destroy lonely, dog, ascii, 1 8 2 4, tarot card design.” Destroy Lonely is not a command, I learned, but a trap artist from Atlanta. Roborock is a Roomba-like automated vacuum cleaner. Phil Foglio is a cartoonist best known for unconventionally silly Magic: The Gathering illustrations. The inclusion of the late-19th-century writer and painter Stanisław Ignacy Witkiewicz affirmed my intuition that there was something vaguely Polish about this image.1)
Stable Diffusion makes images through detailed processes of category-based production, mapping language to a vast set of visual variables, while CLIP Interrogator performs the inverse function. The seemingly random strings of proper and phrasal nouns and adjectives are the result of neural networks “reading” the image and assessing sections of pixels for clues that are then correlated with terms, however opaquely. While the configuration of pixels that translates to “cat sitting on a toilet” is clear enough, those signaling “punkdrone” or “the fool tarot” are less so.
Because there are so many ways to picture even the simplest cat in the simplest scenario, text-to-image and image-to-text models are far from one-to-one processes of translation. If they were, the algorithms and I couldn’t play this game. But close reading even such an unserious set of prompts and images offers clues about the scaffolding behind these operations, as well as broader insights into the clumsy, grab-bag way humans tend to deploy language when attempting to describe an image. By starting with such a nonsensical string of four words, it’s possible I embarked upon a bit of a fool’s errand. But what does the fool seek? In tarot, the Fool is an innocent youth at the start of a journey, representing new beginnings and unlimited potential. He stands on the edge of a precipice, striking a pose not dissimilar to that of the figure in Caspar David Friedrich’s Wanderer Above the Sea of Fog. Though the path ahead is unclear and unstable, the Fool ventures forward.
I first encountered AI-generated images about a decade ago, toward the end of my time in art school. A peer double-majoring in neuroscience and furniture had trained something called an “artificial neural network” to “design” three-dimensional models of simple forms. He input a rendering of a Platonic ideal — a stool, in this case — which served as a template for the network to replicate. Through trial and error, the neural network, which he’d named DANA, or “Designer as Neural Activity,” attempted to craft a stable shape with four legs and a seat. By producing countless assemblages of thin beams and larger cylinders, DANA crept closer and closer to outputting the initial form. DANA’s creator postulated that neural networks could, in the future, be developed to be less reliant on a designer for this instigating visual prompt. “What if DANA, instead of existing in a vacuum, became an open loop of information? What if she took in pictures from Google Images, and learned based off of those, instead of my limited instructions?” He was ahead of his time.
I forgot about DANA’s anarchic stools until January 2021, when news of the image-generating platform DALL-E was suddenly everywhere. Descriptions of the “AI artist,” then in largely private beta testing, still felt like something out of a children’s book: Type in a sentence and the computer magically spits out an image! While the technology sounded too advanced to be real, it had been coming down the pipeline for decades. The first neural network — basically a set of interconnected processing nodes that translates an input signal to an output signal — was proposed in 1943 by the psychiatrist Warren McCulloch and the self-taught logician Walter Pitts as part of an attempt to describe how neurons themselves might work. The technology’s development continued in fits and starts throughout the 20th century, and its promise picked up as the nodes were arranged into increasingly complex layers. (“Deep” machine learning refers to any neural network with three or more layers of nodes.) Neural networks could decipher typed and handwritten characters as early as 1989, and computer-vision applications expanded rapidly in scope and public availability as hardware capacity continued to increase. Soon, optical character recognition, or OCR, converted PDFs to editable text, and now we can copy text snippets in photos taken on our phones. OCR relies on natural language processing, the field concerned with enabling algorithms to output and receive messages in human language rather than a programming language, which likewise advanced as computing power expanded. Natural language processing combines computational linguistics with statistical modeling and algorithms — now usually neural networks — to process and produce “natural” language through methods such as breaking down sentences, tagging parts of speech, assessing words’ most frequent positions in a sentence, and highlighting words that do the most prominent signifying (usually nouns and verbs). At first, these efforts were clumsy and stilted, but we’ve come a long way from AIM’s cherished dimwit, SmarterChild, to the often uncanny responses of Alexa and Siri.
By 2015, algorithmic processes were able to form simple sentences or phrases to describe an image, further integrating computer vision and natural language processing. Patterns of pixels identified as, say, “cat,” “toilet,” or “cup” were matched with linguistic tags, which were then translated into automated image captions in natural language. Quickly, researchers realized they could flip the order of these operations: What would it look like to input tags — or even natural language — and ask the neural networks to produce images in response? But reversing the image-to-text operation presented additional challenges. There is a vast difference between the complexity of a basic phrase and even the simplest image.
Describing any image with a large, centered feline as “a close-up of a cat” is unlikely to be wrong, whether it’s a macro photo of a sultry Siamese or a cell from Sunday’s Garfield. However, there are infinite possible ways to render the image described by “a close-up of a cat.” Is it a hyperdetailed photo in Portrait mode, setting off each strand of the cat’s fur? Or is every part of the image clumsily blurred, as all my iPhone snapshots are, taken through a lens smudged with finger grease? What about the angle, the lighting, the composition? Is the cat lying on a bed, or sitting up on a table? A simpler image of “a close-up of a cat” might be a line drawing or a flat, vector-style illustration, but either of those categories still encompasses thousands of possible variations. The authors of “Generative Adversarial Text to Image Synthesis,” a pivotal paper presented in 2016, at the 33rd International Conference on Machine Learning, described this problem of complexity with the straightforward tone of computer scientists: “There are very many plausible configurations of pixels that correctly illustrate the description.”
Another challenge — one also encountered by image-to-text models but exacerbated by the inversion — is the sheer quantity of visual data required to build up an understanding of the near-infinite visual signs that can be described in language. Some early attempts at image generation dealt with these paired issues of complexity and dataset size by constraining both the style of an image and its subject matter. Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee — the authors of “Generative Adversarial Text to Image Synthesis” — began by training their models on contextually limited libraries of images, specifically the Oxford-102 Flowers and Caltech-UCSD Birds datasets, published in 2007 and 2011, respectively. Caltech-UCSD Birds contains 11,788 photographic images of birds broken down into 200 mostly North American species, annotated with additional attributes such as “Bill Shape,” “Belly Pattern,” and “Underparts Color.” The images making up the dataset were downloaded from Flickr and then categorized and annotated by human workers hired on Amazon’s Mechanical Turk, a crowdsourcing platform often referred to as “artificial artificial intelligence.” While one might assume today’s text-to-image tools have been automated all the way down, their architecture and maintenance rely on enormous quantities of human labor, whether the repetitive “clickwork” performed predominantly in the Global South by workers paid pennies per “task,” or the voluntary, quotidian labor you’ve provided each time you’ve filled out a CAPTCHA. To learn, neural networks need an initial set of labeled and categorized images, and a person needs to do that initial tagging and sorting — in this case, identifying the location of parts (“back,” “beak,” “belly,” “breast”) and attributes (“has_bill_length::about_the_same_as_head”) for the fifty-nine photos that typify the “Glaucous-Winged Gull.” The Oxford-102 Flowers were, somewhat less informatively, “acquired by searching the web and taking pictures.”
By training GANs, or generative adversarial networks — a then-new form of machine learning architecture that pairs neural networks in something of a dialectical back and forth — on these category-specific datasets of attribute-tagged images, Reed and his coathors were able to generate unique, somewhat plausible bird images from natural language phrases. Photographic-looking images of birdlike shapes doing birdlike things in birdlike places emerged in response to “this small bird has a short, pointy orange beak and white belly” and “this magnificent fellow is almost all black with a red crest, and white cheek patch.” The authors then moved on to the MS-COCO dataset, which, unlike Caltech-UCSD Birds and Oxford-102 Flowers, is not limited to a single “object category.” The results of “a large blue octopus kite flies above the people having fun at the beach” were somewhat less convincing, though the GAN does appear to have loosely grasped the concept of a kite, if not the beach.
What are these algorithmic images teaching us to see, say, and do?
Tweet
In February 2019, the US chip manufacturer Nvidia released an open-source version of StyleGAN, a generative AI that produces a near-infinite supply of unique, synthesized images of faces, allowing a user to control features such as face shape and hairstyles. StyleGAN was trained on the FFHQ dataset, or Flickr-Faces-HQ. Like Caltech-UCSD Birds before it, FFHQ is a collection of thousands of images pulled from Flickr, though its README does specify that “only images under permissive licenses were collected.” Workers hired through Mechanical Turk weeded out errant images of “statues, paintings, or photos of photos” from an original pool of algorithmically scraped images. While FFHQ includes “considerable variation in terms of age, ethnicity and image background” and “good coverage of accessories such as eyeglasses, sunglasses, hats, etc.,” its authors clarify that it also inherits “all the biases” present on Flickr.
Days after StyleGAN became open-source, Phillip Wang, a software engineer then working at Uber, created thispersondoesnotexist.com, a website that publishes a new, random, synthesized portrait upon each refresh. From there, a horde of copycats followed: This Cat Does Not Exist, This Horse Does Not Exist, This Foot Does Not Exist, This Pokémon Does Not Exist, This City Does Not Exist, This Chair Does Not Exist, and so on. While fears of deepfakes had been gracing headlines and raising hackles for more than a year, the sudden onslaught of images of People Who Did Not Exist seemed to trip a wire in the broader collective consciousness. Alarm struck the mediasphere as Twitter accounts DMing people for personal information were revealed to be scammers using StyleGAN-generated profile pictures. These deepfaces — a portmanteau that barely tracks — were quickly cited as threats to democracy, and calls arose for “anti-StyleGAN algorithms” that would catch and flag the generated images. Meanwhile, StyleGAN branched out and began to tackle anime portraits. While the image type changed, the subject matter remained constrained.
In contrast, ImageNet, a project initiated in 2006 by the computer scientist Fei-Fei Li, had the immodest aim of “map[ping] out the entire world of objects.” The dataset contains upward of fourteen million annotated images, organized into more than one hundred thousand “meaningful categories.” It also employed the labor of more than twenty-five thousand workers via Mechanical Turk. While one hundred thousand is an astonishing number of categories, it’s extraordinarily small when you consider that there are certainly more than Caltech-UCSD’s chosen two hundred “meaningful” species of birds, and more than five hundred kinds of items in “the entire world of objects,” each of which is likely to have more than two hundred meaningful categories. Categorical reduction and oversimplification never bode well, especially when it comes to labeling humans. ImageNet drew upon a preexisting lexical taxonomy called WordNet, which was developed in the 1980s and borrowed from several earlier lexical sets. As one dataset built upon another, each carried forward the logics and hierarchies of the previous set, if not all its terms. As researcher Kate Crawford and artist Trevor Paglen have written, the original ImageNet dataset contained an image of a child labeled as a “loser”; included the categories “slut,” “whore,” and “negroid”; and curiously placed “hermaphrodite” as a subcategory of “bisexual,” which in turn was listed as a subcategory of “sensualist,” alongside “cocksucker” and “epicure.” In 2019, ImageNet removed more than six hundred thousand images tagged with “unsafe,” “offensive,” or “sensitive” categories, patching the most visible cracks in a fundamentally flawed framework. Still, ImageNet’s categories look controlled and careful when compared with its successors.
On January 5, 2021, when the San Francisco–based research laboratory OpenAI announced DALL-E, it also announced CLIP (contrastive language-image pre-training), an image-classifying neural network, which was integrated into DALL-E’s processes. In a braggy blog post, OpenAI negs the ImageNet dataset for its costliness in terms of time and labor, as well as its limited range of content. “In contrast,” the post’s authors declare, “CLIP learns from text-image pairs that are already publicly available on the internet.” Where on the internet, exactly, we still don’t know. But considering the staggering scale of the training dataset — over four hundred million image-text pairs — the answer is likely pretty much everywhere: open-source images and captions from Wikimedia Commons and “images under permissive licenses” from Flickr, yes, but also almost definitely captioned images from Flickr not under permissive licenses, as well as images from across Facebook, Twitter, Instagram, BuzzFeed, global news outlets, underground news outlets, museum databases, Tumblr, DeviantArt, abandoned LiveJournal and Blogger blogs, and scientific-image databases. We know for certain, though, that CLIP also includes thousands of works by individual artists, illustrators, photographers, and graphic designers.
We know this because one of the things you could do with DALL-E — one of the things you were encouraged to do with DALL-E and its successors — is ask it to generate an image in the style of a particular artist. Unlike deepface, DALL-E is a canny portmanteau, mixing the names of Salvador Dalí — who played Exquisite Corpse with Valentine Hugo and André Breton — and our era’s most lovable robot, Pixar’s garbage-collecting WALL-E. DALL-E has concretized both the portmanteau gesture itself and the name’s subliminal meanings as the core of its identity and suggested function: DALL-E is a surrealist garbage compactor that performs mash-ups, seemingly designed for what OpenAI calls “combining unrelated concepts.”
In summer 2022, nearly a year after a public version called DALL-E Mini was released, social media was flooded with images that followed an “A but B” formula, juxtaposing a subject with an unexpected style or context: “Elon Musk painted by Pablo Picasso,” “Kim Kardashian painted by Salvador Dalí” (naturally), “Nosferatu in RuPaul’s Drag Race,” “R2-D2 getting baptized,” “SpongeBob SquarePants Godzilla,” “Cthulhu on Sesame Street,” and (a personal favorite) “a peanut butter sandwich Rubik’s Cube.” These generated images are not simply Frankenstein’s monsters assembled from various bits of images hoovered from across the web. Instead, GenAI models create generalized ideas of signs, signifiers, image types, and styles that correlate with probable pixel patterns. DALL-E’s deep learning algorithms decode a digital image’s arrangement of pixels into hundreds of axes of variables, which it then uses to assess an image and its component parts, and consequently create similar but unique arrangements in the future. When you ask a GenAI tool like DALL-E or Stable Diffusion to style an image after a particular artist, it isn’t copying the artist’s work so much as it is interpreting and reproducing the artist’s patterns — their subject matter, compositional decisions, and use of color, line, and form.
The quantity and range of images available on the internet, and how they are tagged, impact how well GenAI tools can generate images of a certain subject matter: the more digital images of different works by a particular artist are available, the better the GenAI will be at replicating their style; the more a visual idea appears, the more it will be reproduced. Given that there is, for instance, an overrepresentation of images and descriptions of white men as surgeons on the internet, GenAI tools circa 2023 almost always produced a white man when you asked them to generate a surgeon. Rather than remediate the foundational issues in the datasets, these tools’ developers have attempted to obscure them through “debiasing,” or coding in safeguards to ensure diversity — which is how we get Gemini, Google’s recently rebranded GenAI tool, producing images of Nazis of color when prompted to “generate an image of a 1943 German soldier.”
As text-to-image GenAI tools grew increasingly sophisticated, the surrounding discourse — on news outlets, blogs, Reddit, Substacks, and LinkedIn — grew increasingly alarmed: “Generative AI Is Changing Everything”; “Did Picture-Generating AI Just Make Artists Obsolete?”; “Can AI End Your Design Career?”; “DALL·E 2 (AI Art) Is Getting Too Good That Its Depressing [sic]”; “Art Is Dead and We Have Killed It.” Many of these proclamations came from the camp of AI boosters, others from technophobes and visual artists themselves. In early May 2023, an open letter-cum-manifesto entitled “Restrict AI Illustration from Publishing” appeared on the website of the Center for Artistic Inquiry and Reporting, penned by the institute’s director, Marisa Mazria Katz, and the prominent leftist illustrator Molly Crabapple. The letter outlines something of a fairy-tale relationship between journalism and illustration, which “speaks to something not just intimately connected to the news, but intrinsically human about story itself.” Generative tools, on the other hand, take mere seconds to “churn out polished, detailed simulacra of what previously would have been illustrations drawn by the human hand,” producing images that are either entirely free or cost “a few pennies.” The letter concludes with a call to “take a pledge for human values against the use of generative-AI images to replace human-made art.” More than four thousand people — a range of well-known writers, journalists, artists, and celebrities, including Naomi Klein and John Cusack — have signed.
There are plenty of reasons to be cautious about the use of GenAI for journalistic image production, the technology’s embedded biases and enormous energy footprint chief among them. As of late 2023, Stable Diffusion showed us that “Iraq” only ever looks like a military occupation and that “a person at social services” isn’t white, though “a productive person” usually is, and is always male, while “a person cleaning” is always a woman. Midjourney interpreted “an Indian person” with remarkable consistency as an old, bearded man in an orange pagri, and “a house in Nigeria” as a dilapidated structure with a tin or thatched roof. Meanwhile, a November 2023 study found that producing a single image with GenAI can use about the same amount of energy as charging a smartphone halfway — much more than is required to generate text — and that as models have grown more powerful and complex, they have also grown more energy intensive.
The threats to “human values” and the “humanity” of art, however, strike me as overblown. Humans produce GenAI tools — not only the scripts and mechanisms behind the technology, but the infrastructure at every stage: the Mechanical Turk workers tagging Caltech-UCSD Birds, weeding statues out from FFHQ, and tagging ImageNet; the anons shitposting on Twitter; the Kenyan content moderators paid $2 an hour to review endless horrors just so people can’t accidentally make DALL-E kiddie porn. Human choices, foibles, and prejudices are the very bedrock of these tools. I’m more frightened by GenAI’s humanity — all the assumptions and oddities inherited via their training images, every representational bias enshrined and automated in their tagging sets, each exhausted impulse of the underpaid laborers clicking and sorting as fast as they can — than most other aspects of GenAI.
But what about artists’ livelihoods? It’s true that “no human illustrator can work quickly enough or cheaply enough to compete with these robot replacements,” as Mazria Katz and Crabapple write. But to say that “if this technology is left unchecked, it will radically reshape the field of journalism” is to paint a rather rosy picture of the field. The dystopian future Mazria Katz and Crabapple fear will come to pass if GenAI is left unchecked — the one in which “only a tiny elite of artists can remain in business, their work selling as a kind of luxury status symbol” — is, unfortunately, already here. Most publications see paying fair market wages for the often extensive labor required to produce a custom image as an unjustifiable expense. Why pay for images when there’s a plethora of stock photos and illustrations you can buy super cheaply, memes you can right-click and copy, open-source images you can download from Wikimedia, clip art you can drag and drop in, and preexisting work by illustrators that so many simply screenshot and steal? Of the publications and businesses that do still commission original work, many have long outsourced design and illustration through online gig-work platforms like Fiverr, which were modeled after the general concept of Mechanical Turk. The laborers most likely to be automated out of a job by GenAI, then, are these platform workers drawing and designing for low wages across the Global South.
Any boycott of GenAI imagery is bound to be as effective as a boycott of digital photography to prevent photo developers from losing their jobs, or putting your laptop on the curb in hopes of reinvigorating the market for typists. The best path forward for labor protections might be to ensure that those already trained in crafting communicative, compelling images — illustrators, artists, photographers, photo editors — will be best at using these systems.2Like laptops, cameras, and paintbrushes, GenAI models are tools, and their true efficacy depends upon the skill and knowledge with which they are used. They are also, of course, tools crafted and actively maintained by humans, who deserve to be visible in the chain of image-production labor and considered in discussions of livelihoods.
DALL-E 2 also went nuts for heteronyms.
Tweet
Rather than “artificial intelligence,” then, I prefer to refer to these algorithmic, neural net–powered tools as estranged intelligence, or alienated intelligence. The intelligence — the humanity! — isn’t fake or forged; it is only concealed, outsourced and offshored, remixed and conglomerated, translated into algorithms which it then quietly labors to refine and train. But I know what Mazria Katz and Crabapple mean. It’s insulting to have your hard-won style stolen by an algorithm. I want to believe that something clear and visible is lost in AI-generated images, that what we call “the hand” — all the subtle, holy imperfections and artifacts of existence left on a made thing — is palpably missing. But I have taken all I can find of the online quizzes claiming to test one’s ability to distinguish between AI-generated images and photographs, paintings, and drawings made by other means, and I must be honest: I do poorly on these tests. Certainly they were built to stump, pitting the best outputs of the generators against uncanny works made by other means, but given that I’ve worked as a graphic designer, a design educator, and an editor at an art publication, I’d like to think I have a somewhat discerning eye. What, then, is the tell of absent humanity?
In the early days of DALL-E, Stable Diffusion, and Midjourney, the distinct tics of the generators’ weaknesses — mangled hands, habits of repetition, penchants for centered compositions, errors of physics — more readily betrayed their output as products of AI, while also making it fairly easy to tell images produced by each generator apart. Midjourney couldn’t quite achieve contemporary photorealism, but it excelled at anything painterly, often added a grungy or vintage flair, and had a penchant for gold, orange, and aqua tones. DALL-E preferred photorealism, blank or simple backgrounds, and cutesier cartoon styles. But with each generation of generators the tells have become less visible. What will happen, though, when the proportion of available visual material begins to tip toward the AI generated, and GenAI tools are increasingly trained on images of their own making? Once the snakes are all deepthroating their own orange and aqua tails, what patterns will crystallize in the images they shit out?
While text-to-image (and image-to-text) GenAI tools are built on the foundation of natural language processing, the language that tends to result in the best outcomes — and thus is returned by tools like CLIP Interrogator — reads as far from “natural.” The syntax of prompting is unique enough that a market for so-called prompt engineers has emerged, while blogs and vlogs covering Prompt Writing 101 abound. Some indicate that within prompts, comma-separated phrases order ideas or terms hierarchically. Whatever comes first in the list is deemed most important and will most significantly dictate the image; anything on the third line will likely have a minimal impact. Certain engines, like Midjourney, allow users to alter this schema by adding numbers to phrases to weight their relative consequence. An excessive prompt like “a close-up of a cat with green eyes, blue text that says 3kliksphilp, epic urban bakground, poop, white border and background, licking out, epic poster, office cubicle background, golden toilet, funny cartoonish, erin, classic gem, messy eater, exploitable image, leave, motivational, moving poetry, toilet” piles on related terms in order to establish the image’s content and approximate its style.
Most guides to prompt writing suggest a tripartite form: a subject, a description, and a style/aesthetic of the image. A “description” usually means a present-participle phrase, e.g., “a cat drinking coffee,” or “a bulldog swimming in the ocean.” The unruly prompt that CLIP Interrogator generated for my cat image has a clear subject — the cat with green eyes — accompanied by blue text, on an epic urban “bakground,” with a white border. When it comes to the “style/aesthetic” of the image, though, it’s less immediately clear what applies. “Epic poster” is a style, as is “funny cartoonish” and “exploitable image,” which refers to any kind of meme that someone can customize by adding their own text or supplementary image. But these aren’t the sorts of descriptors one would generally reach for when conjuring visual styles.
Terms that have become popular prompting shorthand for achieving a distinct look include “retro,” “product photography,” “food photography,” “highly detailed,” “digital art masterpiece,” “C4d render,” “Octane Render,” and “trending on ArtStation.” Names of proprietary software and platforms — such as the 3D-modeling software Cinema4D, or C4D for short; Octane, an “unbiased” graphics-rendering software; and ArtStation, a platform showcasing portfolios of work by game designers and animators — have transformed into adjectives or adjectival phrases overnight. Likewise, artists’ names are more often deployed to achieve a visual style than to directly ape an artist’s work. We already have the cultural habit of using proper nouns as eponyms for periods and styles (Louis XIV, Bauhaus, Studio 54), but prompt language has accelerated the trend. There are now websites, like Midlibrary.io, that catalog thousands of image styles indexed by artist names, largely those of digital artists and concept designers.
Prompt crafting relies on learning these terms and understanding the mass of visual phenomena to which they are yoked — the subject matter, visual attributes, media, and composition styles. To prompt an image in the style of Witkiewicz’s late pastel portraits would mean to understand that a generator’s output would likely be of a woman from the waist up, somewhat centered in the frame, with thin, high-arched eyebrows; pursed, glossy lips; oversize eyes rendered with more polish than the rest of the image; and at least a whiff of the sitter’s contempt. Depthless backgrounds might encroach on the figure with ragged, nearly electric lines; clothes might be rendered with similar eccentricity. Other than its somewhat large eyes, my toilet-perched cat looked nothing like this, nor did it significantly resemble Witkiewicz’s earlier Expressionism. After some additional searching, I realized that the closest reference for “by Ignacy Witkiewicz” may be a painting of Witkiewicz. Rafał Malczewski’s portrait of the painter includes, among other curious features, what I can only assume is a particularly bizarre cat balanced across Witkiewicz’s shoulders, a perch from which it leers with buggy eyes and a nonexistent jaw. “Witkiewicz” alone might not get me what I’m looking for if I’m seeking a vibrational, contemptuous woman, but “pastel portrait by Stanisław Ignacy Witkiewicz” could get me closer. (Art history may yet matter.)
While prompt writing is quickly becoming a marketable skill, there’s still much about the innermost workings of the deep learning algorithms that even the most advanced engineers don’t fully understand. Sam Bowman, who runs an AI research lab at NYU, told Vox that even specialists like him can’t discern what concepts or “rules of reasoning” are being used by most of these complex systems. “We built it, we trained it, but we don’t know what it’s doing,” Bowman confessed. This sense of mystery tends to be echoed by GenAI tools whenever they’re asked to describe themselves. After a Redditor prompted GPT-4 to visualize itself, it used DALL-E 2 to generate images of glowing orbs reaching toward the shelves of a library with illuminated tendrils. The tendrils’ tips connect not to books but to scattered floating shapes, glowing in midair. I appreciate these images for the candor of their fantasy: GenAI is always surrounded by knowledge, but does not access it straightforwardly.
Circa October2022, back in the days of clear GenAI tells, OpenAI’s DALL-E 2 had a hard time with context clues and sequencing, particularly when dealing with how adjectives or descriptive phrases are applied to nouns or verbs. If you told DALL-E 2 to generate “a fish and a gold ingot” it usually gave you a fish that was also gold, frequently a goldfish, as if attempting a kind of wordplay.
DALL-E 2 also went nuts for heteronyms. One heteronym-specific example, as elucidated by Royi Rassin, Shauli Ravfogel, and Yoav Goldberg in “DALLE-2 Is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image Models,” includes the prompt “a bat is flying over a baseball stadium,” which produced a jaunty, cartoonish, vector-like illustration of a baseball stadium, over which a baseball and both a baseballbatand the animal we know as a bat fly, racing out of the image’s top right corner. The problem is that the tag “bat” correlates to two different kinds of pixel patterns, and the GenAI isn’t sure which to choose. Hedging its bets, it throws in both.
Rassin et al describe the confusion that lurks in these linguistic-to-visual translations as the “semantic leakage of properties between entities.” In the image, the two kinds of bats appear to be soaring in tandem; perhaps the bat (animal) is actually wielding the bat (baseball). A white teardrop shape seems an attempt at a smile, indicating our friend the bat (animal) is having a great time. He appears to either be playing the game or to be gleefully absconding with the bat (baseball) as an act of other-team sabotage. To the bats’ left, a flat gray cloud and a lightning bolt interrupt the blue sky. The paper’s authors don’t provide a clear linguistic reason for how the lightning bolt snuck in there, but my untested image-associative guess is that bats (animal) frequently show up in imagery with witches, who are prone to doing spells and zapping things.
The lightning bolt is a good example of what Rassin et al refer to as “second-order stimuli”: the networked associations embedded in language and images we’re likely not conscious of unless we’re paying very specific attention. When you ask DALL-E 2 for an armadillo on a sea shore, it will often throw in a few shells as well. Why? Well, think of armadillo’s word-cloud terms, or what Fei-Fei Li calls its “social network of visual concepts”: mammal, armor, ball, and . . . shell. (For comparison, a request for “dog on a sea shore” generates a beach, but no shells.) This “leakage” of associative traits is subtler than DALL-E 2’s direct confusion of heteronyms and is therefore somewhat harder to spot. However, it can add a deeper layer of absurdity to these images, which is often pointed to as proof of the generative tools’ lack of sophistication, their poor results. It would be a mistake, though, to treat semantic leakage as proof of technology’s clumsiness rather than its acute sensitivity. “A tall, long-legged, long-necked bird and a construction site” spits out an image that includes both a crane (bird) and a crane (construction equipment). While this would initially read as an error, and software engineers are surely working to resolve the bug, it’s in fact a sophisticated linguistic affiliation, a return of the heteronym problem by proxy, as the word crane never appears in the prompt.
For all the biases and patterns that they reify, GenAI tools also inherit and pictorialize language’s nuances and ambiguities — such as English’s excess of heteronyms and homonyms, and their possible confusion. When a publisher turns to text-to-image generative AI to illustrate a new biology textbook, for example, what carceral elements will be interwoven in diagrams of “the cell,” and, from there, into a generation of students’ concepts of cytology? New image-making technologies — whether the printing press, the camera, or satellite imaging — change our perception of the world, which in turn changes our behaviors. The question at hand is: What are these algorithmic images teaching us to see, say, and do?
As of January 2024, GenAI text-to-image tools produced about thirty-four million images per day. This number is still dwarfed by the daily count of digital photographs, but for how long? From here on out, it’s safest to assume that any image you encounter might be generated. What differentiates these images is not their lack of humanity but their intense abundance of it: all the alienated intelligence, historical strata, and linguistic tics embedded and reproduced within them. Each prompter sets off a huge chain of networked collaboration with artists and academics, clickworkers and random internet users, across time and space, engaging in one massive, multicentury, ongoing game of Eat Poop You Cat. Like it or not, we all — whether pre-algorithmic image makers or self-described AI artists — will have to learn to play.
Witkiewicz is a fitting reference for a tool that creates images by interpreting visual categories. After painting for decades, he abandoned his less-than-successful Expressionist style to establish the “S. I. Witkiewicz Portrait-Painting Firm,” for which he became well-known. He defined seven “types” of portraits, such as “Type A—Comparatively speaking the most ‘spruced up’ type” and “Type B+ supplement—Intensification of character, bordering on caricature. The head larger than natural size.” Each painting’s type determined its price, which generally decreased as Witkiewicz gained leeway to distort the figure. (This rule was notably broken by the priceless “Type C, C + Co, Et, C + H, C + Co + Et, etc,” which were “executed with the aid of C2H5OH,” colloquially known as alcohol, and “narcotics of a superior grade.” ↩
Wired, the first US publication to adopt an official AI policy, has already enshrined this idea in guidelines. “Some working artists are now incorporating generative AI into their creative process in much the same way that they use other digital tools,” the policy notes. Wired “will commission work from these artists as long as it involves significant creative input by the artist and does not blatantly imitate existing work or infringe copyright. In such cases we will disclose the fact that generative AI was used.” The magazine expressly says it will not use GenAI images instead of stock photography, as “selling images to stock archives is how many working photographers make ends meet.” ↩
If you like this article, please subscribe or leave a tax-deductible tip below to support n+1.

No comments yet. Be the first to comment!