
— The arc in numbers —
1973
AARON — Harold Cohen’s rule-based art system begins
2014
Goodfellow’s GAN paper changes the trajectory
2022
DALL-E 2, Midjourney, Stable Diffusion all ship
2025
Gemini 2.5 Flash Image (“Nano Banana”) ships
§ 01 · The Pre-History
Computers made pictures for forty years before anyone called it art.
The first computer-generated images that anyone took seriously came out of Bell Labs and military research programs in the early 1960s. Michael Noll, working at Bell Labs in 1962, produced patterned images on a microfilm plotter that he later described, with characteristic understatement, as the first attempts at computer art. Around the same time, Frieder Nake in Germany and Vera Molnár in France were using mainframe plotters to generate algorithmic compositions that wound up in gallery shows. None of these systems would be recognised today as anything resembling generative AI. They were deterministic. They followed rules. The artist wrote a programme, the programme drew an image, and the result reflected the human’s choices more than the machine’s.
The first system that started to look like AI image generation in the modern sense was Harold Cohen’s AARON, which Cohen began building at the University of California in the early 1970s. AARON was a rule-based program that generated original drawings and, eventually, paintings — with Cohen continuing to refine its rule set for more than four decades. AARON did not learn from data. It encoded Cohen’s artistic principles into branching logic, and the output was the result of the program executing those rules with controlled randomness. The work hung in major museums. It was credited to AARON, not to Cohen. The philosophical question of who exactly was the artist became, more than fifty years before generative AI tools went mainstream, a live debate.
The 1980s and 1990s saw a parallel track in fractal art, evolutionary art, and procedural generation. Karl Sims’s evolutionary image systems — in which images were “bred” through genetic algorithms with human selection — produced striking results and pointed at something important: that the most useful role for a human in the loop might not be specifying the output, but selecting among machine-generated options. Sims’s work, alongside the broader genetic programming community, anticipated the prompt-and-curate workflow that would define generative AI thirty years later. None of this, however, was generative AI in any modern sense. The systems did not understand what they were drawing. They generated visual structure and let humans decide which structures were interesting.
— The Pre-Deep-Learning Era —
| Year | System | Approach | Significance |
|---|---|---|---|
| 1962 | Bell Labs plotter art | Algorithmic patterns on microfilm | Earliest serious computer-generated imagery |
| 1973 | AARON | Rule-based expert system | First long-running autonomous “art-making” programme |
| 1985 | Mandelbrot set tools | Iterative fractal computation | Made aesthetic complexity from minimal code |
| 1991 | Karl Sims’s evolutionary art | Genetic algorithms + human selection | Foreshadowed the prompt-and-curate workflow |
| 2002 | Perlin noise tooling | Procedural texture generation | Backbone of game and effects visuals for two decades |
| 2009 | Deep belief network image work | Restricted Boltzmann machines | First serious neural-network image synthesis attempts |
Generative imagery existed for half a century before deep learning. The pre-2014 systems were largely deterministic, rule-based, or used neural networks too small to generalise.
§ 02 · The GAN Era (2014–2018)
Ian Goodfellow’s napkin sketch that quietly broke open the field.
The conventional retelling of generative AI history starts in 2014, in a Montreal bar called Les 3 Brasseurs, where Ian Goodfellow — then a PhD student at the Université de Montréal — was arguing with friends about how a neural network might generate convincing images. The conventional wisdom at the time was that you needed an enormous amount of statistical work to model the distribution of real images directly. Goodfellow’s insight, allegedly worked out over beers and then implemented overnight at home, was that you did not need to model the distribution at all. You needed two networks: one that tried to generate realistic images, and another that tried to tell real images from fakes. The two networks would train against each other, each forcing the other to get better. He called the architecture a Generative Adversarial Network. The paper went online a few months later. The field changed.
The first GAN images were, by modern standards, unimpressive. Tiny, blurry, with a distinctive plastic quality and frequent failures of basic geometry. But they were unmistakably faces, animals, and rooms generated by a machine that had never been told what those things looked like — only shown enough examples to learn the pattern. The community moved fast. By 2017, NVIDIA’s Progressive Growing GAN was producing photorealistic faces at megapixel resolution. In 2018, StyleGAN dropped the rest of the field a generation behind: it could not only generate faces but disentangle style and content, letting a user mix the structure of one face with the colouring and texture of another. The website ThisPersonDoesNotExist.com, launched in early 2019 by an engineer working alone with StyleGAN, gave the public its first viral encounter with a generative-AI artefact. Most users could not tell the synthetic faces from real photographs.
The GAN era ran from 2014 through roughly 2021 and produced an extraordinary amount of research. BigGAN scaled GANs to ImageNet-quality outputs at 512×512. CycleGAN learned image translation between unpaired domains. GauGAN generated photorealistic landscapes from rough segmentation maps. The technical achievements were real. But GANs had a structural problem the community could not quite solve: they were brittle to train, prone to mode collapse, and stubbornly resistant to text conditioning. You could generate a convincing face, or a convincing horse, but reliably generating “a horse with a moustache wearing a top hat” was nearly impossible. The next breakthrough would come from somewhere else.
§ 03 · The Language Bridge (2019–2021)
CLIP turned out to be the missing piece nobody had named.
The real conceptual breakthrough between GANs and the modern era arrived almost as an afterthought. In January 2021, OpenAI released two papers on the same day: DALL-E, a model that generated images from text descriptions, and CLIP, a model that mapped images and their text captions into the same mathematical space. DALL-E got the press. CLIP turned out to be the foundation of everything that came after. By training on roughly four hundred million image-text pairs scraped from the internet, CLIP learned a joint representation in which the embedding for the phrase “a yellow dog on a beach” landed very close to the embedding for an actual photograph of a yellow dog on a beach. This sounds modest. It was not. It meant that any generative model could be guided by CLIP’s text understanding without having to learn that understanding itself.
DALL-E 1, the first text-to-image model from OpenAI, used a discrete variational autoencoder (VQ-VAE) approach rather than GANs — the model treated image generation as a sequence-prediction problem, generating tokens that decoded back to image patches. Outputs were small, often surreal, and inconsistent, but the system could draw what the prompt described, which had never quite worked before. The internet noticed. Around the same time, independent researchers and hobbyists were stitching CLIP onto existing generative models in clever ways: CLIP-guided GANs, CLIP-guided VQ-VAEs, and eventually CLIP-guided diffusion. The CLIP-guided diffusion experiments through 2021, particularly the ones by researchers like Katherine Crowson and the broader open-source community, produced the first generation of text-to-image art that ordinary people on Twitter could replicate at home with a Google Colab notebook. The technology stopped being something only OpenAI could ship.
For seven years the community had been trying to make GANs understand language. The answer turned out to be to stop making the generative model understand language at all, and just hand it a translation already done by something else.
MMD Newswire · Editorial
§ 04 · The Diffusion Breakthrough (2022)
Three models, one year, an entire industry rewritten.
The technical foundation for what happened in 2022 was a class of generative model that had been quietly maturing in academic papers since 2015: diffusion models. The intuition behind diffusion is unexpectedly elegant. You take a real image and progressively add noise to it until it becomes pure static. Then you train a neural network to reverse the process — to denoise back toward the original. Once trained, you can give the network a sample of pure noise and it will, step by step, denoise it into a coherent image that matches whatever conditioning you provide. The mathematics were established in earlier papers, but the systems were slow and the outputs underwhelming. What changed between 2020 and 2022 was a combination of better training techniques, larger models, and the CLIP-based conditioning that let diffusion models accept text prompts as guidance.
DALL-E 2 launched in April 2022 and immediately broke the previous benchmark for text-to-image quality. The model could produce photorealistic outputs, illustrations in identifiable artistic styles, and surreal compositions with a coherence that GAN-era systems had never managed. OpenAI gated access initially, releasing it in waves through a waitlist that became its own minor cultural event. Then, in July 2022, an independent research lab called Midjourney launched a Discord-based image generation service that anyone could use immediately. Midjourney’s outputs had a distinctive painterly quality and reached a quality threshold that, more than any other system to that point, made AI art something the wider public actually wanted to share. The Discord community grew to millions of users within months.
The genuinely transformative event came in August 2022. Stability AI released Stable Diffusion as fully open-source weights. Anyone with a consumer GPU could now run a state-of-the-art text-to-image model locally, fine-tune it on their own data, modify the architecture, and integrate it into any application without an API. The combination of CompVis’s technical work, EleutherAI’s LAION dataset, and Stability AI’s funding produced an open foundation model that the proprietary players could not match on accessibility. Within weeks, fine-tunes, GUIs, browser plugins, and creative tools were emerging from a global hobbyist community that had not existed three months earlier. The text-to-image space went from three serious systems to thousands of derivative projects in a single quarter. There has been nothing quite like it in the history of AI deployment.
— The Deep Learning Generation —
| Year | System | Architecture | What It Changed |
|---|---|---|---|
| 2014 | Original GAN | Generator + discriminator | First viable neural image synthesis |
| 2017 | Progressive GAN | Multi-scale GAN training | Megapixel photorealistic faces |
| 2018 | StyleGAN | Style-based generator | Disentangled control over generation |
| 2021 | DALL-E 1 | VQ-VAE + transformer | First serious text-conditioned generation |
| 2021 | CLIP | Contrastive language-image embedding | Made everything else conditionable on text |
| 2022 | DALL-E 2 | CLIP-guided diffusion | Photorealistic text-to-image quality |
| 2022 | Midjourney v1–v3 | Proprietary diffusion variant | Made AI art a consumer phenomenon |
| 2022 | Stable Diffusion 1.x | Latent diffusion, open weights | Democratised the technology globally |
| 2023 | SDXL, DALL-E 3, MJ v6 | Refined diffusion at scale | Production-grade output reliability |
| 2024 | Flux series | Rectified flow architecture | Open-weight quality matching closed models |
| 2024 | GPT-4o (image) | Multimodal autoregressive | Image generation in the same model as chat |
| 2025 | Gemini 2.5 Flash Image | Multimodal “Nano Banana” | Character consistency & rapid editing at API scale |
A handful of architectures, each unlocking a capability the previous generation could not match. The compounding effect is what produced the present-day capability.
§ 05 · The Open-Source Explosion (2022–2023)
A foundation model in the wild produces things its creators never imagined.
The post-Stable-Diffusion phase reshaped expectations about how AI capability propagates. Within weeks of the release, communities on Discord, Reddit, and GitHub were producing fine-tuned variants for specific aesthetics — anime, photography, fantasy art, architectural visualisation, product photography. The LoRA technique — Low-Rank Adaptation, which allowed efficient fine-tuning of large models with small training sets — meant that anyone with a few hundred sample images and a consumer GPU could produce a specialised version of Stable Diffusion in an afternoon. By mid-2023, the open-source ecosystem around image generation was producing more capability, faster, than any single proprietary lab could match on equivalent infrastructure spend.
The unintended consequence was that the legal, ethical, and labour questions that the field had been quietly avoiding came due all at once. Visual artists discovered their styles being explicitly reproduced by tools trained on their work without consent. Stock photo libraries discovered their archives in training datasets. The dataset that underlay most early diffusion models, LAION-5B, was found to contain problematic content that the original scrapers had not properly filtered. Class-action lawsuits arrived. Getty Images sued Stability AI. Artists organised through groups like Concept Art Association and the European Guild for AI Regulation. The pure technical optimism of mid-2022 gave way, within eighteen months, to a much more complicated conversation about consent, compensation, and the rights of training-data subjects. None of these questions are settled. Most of them are now being negotiated through court cases, the EU AI Act’s training-data transparency provisions, and individual licensing agreements that did not exist in 2022.
§ 06 · The Refinement Era (2023–2024)
The boring years, in which most of the actual improvement happened.
The two years between the diffusion breakthrough and the multimodal era look quiet in the popular narrative. Inside the industry they were anything but. SDXL, the next major Stable Diffusion release, raised baseline quality substantially. DALL-E 3 in late 2023 brought a step-change in prompt adherence — the model actually followed long, detailed prompts in a way that previous systems had struggled with. Midjourney v6 produced photorealism that consistently fooled professionals in side-by-side tests with real photographs. ControlNet, ComfyUI, and a whole tooling layer around the open-source models added compositional control: you could now specify pose, depth, edge structure, segmentation map, and reference imagery as additional inputs alongside the text prompt. The Flux series, released by Black Forest Labs in 2024, gave the open-weight community a foundation model that matched or exceeded the proprietary options on most benchmarks.
The genuinely consequential shift in the refinement era was not in image quality per se. It was in workflow integration. Adobe Firefly arrived inside Photoshop. Canva built generative tools into its core editor. Microsoft Designer, Google’s ImageFX, and a long tail of vertical-specific tools embedded text-to-image generation inside the creative workflows that working designers actually used. The technology stopped being something you opened a separate tab to access and started being something you summoned with a keyboard shortcut inside the application you were already using. That shift mattered more for adoption than any quality improvement during the same period. The user base of generative image tools roughly quadrupled between 2023 and 2025, and the largest single driver was integration into existing software rather than novelty experiences.
§ 07 · The Multimodal Frontier (2024–2025)
A model nicknamed Nano Banana, and the consistency problem solved at last.
The most consequential technical shift in image generation since the original diffusion breakthrough arrived in 2024 and 2025 with a generation of multimodal models that treated text and image generation as the same problem rather than two adjacent ones. GPT-4o, OpenAI’s flagship 2024 model, could generate images natively as part of a chat — not by handing the prompt off to a separate image model, but by directly producing image tokens alongside text tokens in a single forward pass. The integration meant that the model understood the conversation it was generating an image inside, could maintain visual context across multiple generations in a chat thread, and could follow editing instructions in natural language with substantially better fidelity than the older text-to-image-only systems.
Google’s response, released in August 2025, was a model the team had nicknamed Nano Banana during development — ostensibly because the lead researcher had a banana on his desk during a critical review meeting, though the actual etymology has been disputed within Google. The model shipped publicly as Gemini 2.5 Flash Image and lived up to its quietly accumulated hype. It produced image generation and editing capabilities at conversational speed, but the genuinely new capability was character consistency: you could feed the model a reference image of a person, place, or object and have it appear coherently across dozens of subsequent generations with different scenes, poses, and styles. The failure mode that had haunted text-to-image systems since DALL-E 1 — the inability to keep a character’s face the same across a sequence of pictures — was, for the first time, largely solved at API scale. Within weeks of release, design studios, advertising agencies, and small creator businesses had built workflows around it that previous tools could not support.
Nano Banana’s release also crystallised a competitive dynamic that had been building throughout the multimodal transition. The frontier image generation capability was now embedded inside general-purpose foundation models from a small number of hyperscaler-backed labs — OpenAI on Microsoft Azure, Google’s Gemini family on GCP, Anthropic’s Claude (text-focused but increasingly multimodal), and a long tail of specialised image-generation services running on the same underlying cloud infrastructure. The open-source ecosystem remained vital and continued to improve, particularly through the Flux family and Black Forest Labs’ ongoing releases, but the gap between the best open-weight models and the best proprietary multimodal systems widened during 2025 in ways the community had not seen since the original Stable Diffusion release closed it.
§ 08 · The Economics in 2026
Image generation got cheap. Image-generation businesses, less so.
A single image generation from a leading model in 2026 costs somewhere between a tenth of a cent and a few cents at API rack rates, depending on the model, the resolution, and the conditioning inputs. That is roughly two orders of magnitude cheaper than what equivalent quality cost on per-image API pricing in 2022, and several orders of magnitude cheaper than what it would have cost in compute-hour terms to train a comparable system in-house. The per-image numbers are small enough that most casual users never have to think about them. The per-image numbers are also large enough that any business generating images at production volume — ad agencies running thousands of asset variations, e-commerce platforms producing product images, design tools embedding generative features into their core workflow, content marketing operations producing imagery at scale — ends up with a meaningful monthly inference bill.
The current cost dynamics have produced a quiet shift in how serious users of image generation APIs procure their capacity. The hyperscalers offer significant discounts on prepaid commitment tiers — commit to a year of GCP Vertex AI spend, AWS Bedrock spend, or Azure OpenAI spend, and the effective per-image cost drops meaningfully. The trade-off is that the commitment is locked in regardless of actual consumption, and businesses routinely end the year sitting on unused balances they cannot recover under standard agreements. Secondary marketplaces have emerged to clear this inefficiency, matching buyers and sellers of unused cloud credits directly. AI Credit Mart is one of the better-known operators in this space, providing a venue where studios, agencies, and AI-first product teams can Buy cheap nano banana api capacity through discounted GCP credits left over from other buyers’ commitments — an arrangement that has become a routine part of procurement for any operation running image generation at meaningful scale.
The compounding effect of cheaper per-image cost and growing per-business volume means that the total spend on image generation infrastructure has continued to rise sharply across the industry, even as headline per-image pricing has fallen. The history of AI image generation is, increasingly, also the history of the infrastructure economics that determine which businesses can afford to deploy it at scale and which cannot.
— Reader Questions —
Fifteen questions on the history and present of AI image generation.
When did AI image generation actually start?
It depends on the definition. Computer-generated imagery in any form began at Bell Labs and adjacent research labs in the early 1960s. Rule-based art systems began in the 1970s with Harold Cohen’s AARON. Neural-network image generation in the modern sense began in 2014 with the GAN paper. Text-to-image as the public would recognise it began in earnest with DALL-E 1 in early 2021 and the diffusion breakthrough in 2022.
What was the GAN paper and why did it matter?
Ian Goodfellow’s 2014 paper introducing Generative Adversarial Networks — two neural networks trained against each other, one generating images and one trying to spot the fakes. It was the first widely successful approach to producing convincing synthetic images, and it dominated the field for roughly the next seven years before diffusion took over.
What is CLIP and why is it considered foundational?
CLIP, released by OpenAI in January 2021, is a model that maps images and their text captions into the same mathematical space. It made text-to-image generation practical because subsequent generative models could be guided by CLIP’s text understanding without having to learn that understanding themselves. Nearly every text-to-image system since 2021 uses CLIP or a CLIP-like text encoder somewhere in the pipeline.
What is a diffusion model?
A class of generative model trained by progressively adding noise to real data and then learning to reverse the noising process. At inference time, you give the model pure noise and let it denoise step by step into a coherent image, guided by a text prompt or other conditioning. Diffusion models replaced GANs as the dominant text-to-image architecture in 2022 and remain the foundation of most current systems.
Why was 2022 such a big year?
Three independent systems crossed the public-usability threshold within a few months: DALL-E 2 in April, Midjourney in July, and open-source Stable Diffusion in August. The combination meant that text-to-image generation went from a research demo to a mainstream consumer technology in less than a single calendar year.
What was the significance of Stable Diffusion being open source?
It was the first time a state-of-the-art image generation model was released with fully open weights, runnable on consumer GPUs, fine-tunable with small datasets. Within months a global ecosystem of fine-tunes, plugins, and applications had emerged that the proprietary players could not match on accessibility. The shift in capability propagation was historically novel.
What is Nano Banana?
Nano Banana is the internal nickname for Google’s Gemini 2.5 Flash Image model, released publicly in August 2025. The nickname came from the development team and got picked up by the wider community before official release. The model’s most distinctive capability is character consistency across multiple generations, alongside fast image editing through natural-language instructions.
What problem did character consistency solve?
Previous text-to-image systems struggled to keep a character’s face, body, and clothing consistent across multiple generations. Generating a sequence of pictures of the same person in different scenes typically produced subtle drift — a different jawline, different eyes, different proportions. Nano Banana and adjacent multimodal models largely solved this at API scale, which unlocked workflows for advertising, design, comics, and editorial illustration that previous systems could not support.
Is open-source image generation falling behind the closed models?
Selectively. The Flux series and adjacent open-weight models remain competitive on raw image quality and offer flexibility the closed systems cannot match. But on multimodal capabilities — conversational editing, character consistency, instruction following across long contexts — the closed-source frontier has pulled ahead during 2025 in ways the community had not seen since the original Stable Diffusion release. The gap is meaningful but not decisive.
Are AI-generated images legal?
Mostly yes, with significant grey areas. The output of an AI image generator is typically not copyrightable under current US case law unless there is meaningful human creative input. Training-data legality is contested through multiple ongoing lawsuits. The EU AI Act includes training-data transparency provisions that are now binding for new models. Specific uses — deepfakes of real people, generations in the style of living artists, commercial use of copyrighted characters — have their own evolving legal questions.
How much does it cost to generate an AI image?
In 2026, between roughly a tenth of a cent and a few cents per image at API rack rates from major providers, depending on model, resolution, and conditioning inputs. Per-image cost has dropped by roughly two orders of magnitude since 2022. Per-business cost has grown faster than that, because volume has expanded much more rapidly than unit cost has fallen.
Are there ways to get image-generation API access at a discount?
Yes. The most common is hyperscaler commitment tiers — prepaid annual agreements with AWS, Azure, or GCP that include significantly reduced effective pricing. Secondary marketplaces for unused enterprise credits have also emerged to clear the inefficiency from over-committed annual deals, providing access to image-generation API capacity at meaningful discounts to the rack rate.
What is a LoRA in this context?
Low-Rank Adaptation — an efficient fine-tuning technique that allows large image-generation models to be specialised for particular aesthetics, subjects, or styles using small training sets and modest compute. LoRA underpins much of the open-source ecosystem around Stable Diffusion and Flux, allowing hobbyists and small studios to produce custom variants without retraining the entire base model.
Why are visual artists upset about AI image generation?
Because training datasets for most major image-generation models include vast amounts of copyrighted artwork that was scraped from the web without artist consent, and the resulting models can produce work in identifiable artistic styles. The legal frameworks around this are still being negotiated. The professional and economic harm to working artists is being documented through industry surveys and case studies.
What comes next for AI image generation?
Several trends look likely. Multimodal integration continues, with image generation embedded deeper inside general-purpose foundation models. Video generation, treated as a closely related problem, matures rapidly through 2026 and 2027. Personalisation and character consistency continue to improve, particularly for branded and IP-driven workflows. Cost dynamics shift further toward businesses procuring through cloud commitment tiers and secondary marketplaces. Regulatory frameworks — particularly around training data, watermarking, and deepfake risks — tighten unevenly across jurisdictions.
— Editor’s Note —
On retelling a history while it is still happening.
Any history of a field still in motion is a provisional document. The dates are firm, the systems were real, but the relative significance of the moves between 2022 and now will not be settled for years. The conventional narrative around GANs being “replaced” by diffusion glosses over how much GAN-era research fed into the diffusion era. The conventional narrative around CLIP makes it look more like a planned breakthrough than a happy accident that turned out to enable everything else. The conventional narrative around Stable Diffusion as a populist moment understates how thoroughly the labour and legal questions complicated that populism within eighteen months of release. Where the historical record is contested, we have tried to flag it rather than smooth it over.
MMD Newswire is editorially independent and does not accept compensation from vendors mentioned in our coverage. References to specific platforms, models, or marketplaces reflect our editorial judgement about what serves our readers, not commercial relationships. The framings, interpretations, and structural reads in this article are our own.
