Open source maintainers need an answer to AI clean rooms
As of right now, AI tools make all LICENSE files effectively worthless.
As of right now, AI tools make all LICENSE files effectively worthless. If you run an open source project, any license you put on that project can be easily bypassed. Consider adopting the Ship of Theseus license to try and patch the hole.
The way open source licensing works is pretty straightforward. By default, any code you write is yours, you have a copyright on it. No one is allowed to use it. This applies to million line codebases, and it applies to the smallest of code snippets on Stack Overflow.1 In order for someone to use code without running afoul of the law, the author of the code has to give explicit permission detailing what can be used and how. These permissions are often laid out in a LICENSE document.
Behind our current, relatively straightforward understanding of open source licensing lies ~45 years of court battles and lawfare, because trillions of dollars are at stake. That’s not a typo. Take Linux as an example. Your phone runs Linux. The web server your phone talks to runs Linux. All of the machines that pass messages between your phone and that server run Linux. Your blender used to run Java, but now it runs Linux. Linux is the most valuable individual piece of software in the world. It’s worth $0, you can download a copy right now, for free. So the question of who owns Linux and how it can be used are really really important.
One of the most important license innovations was the invention of ‘copyleft’. The basic idea: anything that is built with a copyleft license has to also use a copyleft license. Even though Linux is open source, and even though you can download a copy of it, you cannot package it up in a closed source box and try to sell it. You are legally required to release the source code for anything that derives from Linux.
It is shocking how much of the modern tech stack depends on copyleft licenses. Most compilers are copyleft. The GNU Binutils are copyleft. Git is copyleft. So is Bash. MySQL, VLC, coreutils, glibc, ffmpeg. More recently, wordpress, mongodb, elasticsearch. These are foundational pillars of the global tech stack. Thousands of companies have spent tens of millions of dollars of employee time on these open source projects, because the companies depend on them somewhere in their stack. Many of those companies would have preferred their employee time going to closed source software that they could then redistribute at a profit. Copyleft licensing prevents that. And many projects — like MySQL, which uses a copyleft license for general use but offers a paid option for companies who want to get rid of the copyleft requirements — could only get funded through the existence of copyleft licensing.
As you might imagine, many people have spent a lot of time thinking about how to get around these licenses. Code is and always has been in a weird gray area when it comes to intellectual property. You can’t copyright math. You can’t copyright an idea. But you can copyright code. So what, exactly, are you protecting? The courts say that you are protecting a specific expression of an idea. If someone copied the Linux kernel, or if someone wrote their own kernel while looking at Linux, all of that new code is based on the previous implementation. So it’s all protected by the Linux license. But if someone just, like, read about the Linux kernel, and got a really good understanding of how it behaves, and then made their own version of the kernel, that would be a new implementation and would be totally fair game.
The technical jargon for this is a “clean room implementation.” Team A spends time pouring over the code and writing an extremely detailed specification without explicitly writing code. And then Team B, which has never looked at the original source, writes new code to meet the specification. Team A and Team B don’t interact at all otherwise, to ensure the final output is ‘clean’.
Traditionally this kind of license circumvention is extremely costly. It requires a lot of time and at least two (teams of) people.
AI makes this trivial.
You have a session of Claude looking at the original code base and writing a spec. And then a different session of Claude looks at the spec and writes new code. The (untested) legal theory is that this is sufficient to remove the license, because the new code is “clean”.
People are already using this strategy to remove licenses.
In the world of open source, relicensing is notoriously difficult. It usually requires the unanimous consent of every person who has ever contributed a line of code, a feat nearly impossible for legacy projects. chardet, a Python character encoding detector used by requests and many others, has sat in that tension for years: as a port of Mozilla’s C++ code it was bound to the LGPL, making it a gray area for corporate users and a headache for its most famous consumer.
Recently the maintainers used Claude Code to rewrite the whole codebase and release v7.0.0 , relicensing from LGPL to MIT in the process. The original author, a2mark , saw this as a potential GPL violation…
Context: The viral GitHub fork of the leaked Claude Code was at immediate risk of a DMCA takedown (Anthropic had killed prior mirrors in minutes), so its maintainer — worried about getting sued — used OpenAI’s Codex to rewrite the entire ~512k-line TypeScript codebase from scratch into Python overnight as a “clean-room” reimplementation.
This preserved the full agent harness, tools, and behavior without copying a single original line, instantly turning a copyright landmine into the safe, exploding open-source version everyone is now starring.
Finally, liberation from open source license obligations.
Our proprietary AI robots independently recreate any open source project from scratch. The result? Legally distinct code with corporate-friendly licensing. No attribution. No copyleft. No problems.
…
Our proprietary AI systems have never seen the original source code. They independently analyze documentation, API specifications, and public interfaces to recreate functionally equivalent software from scratch.
The result is legally distinct code that you own outright. No derivative works. No license inheritance. No obligations.
I’m not sure if the last one is satirical — it is literally named ‘evil corp’ — but according to Reddit:
Clearly meant to be satire, with the name of the company basically being “EvilCorp” and the fake user quotes from names like “Chad Stockholder”, but it does actually accept payment and seemingly does what it describes, so it’s certainly a bit beyond just a joke at this point. A livestreamer recently tried it with some simple Javascript libraries and it worked as described.
A virtual bash environment with an in-memory filesystem, written in TypeScript and designed for AI agents.
Broad support for standard unix commands and bash syntax with optional curl, Python, JS/TS, and sqlite support.
after which Cloudflare rebuilt Next.js
Last week, one engineer and an AI model rebuilt the most popular front-end framework from scratch. The result, vinext (pronounced “vee-next”), is a drop-in replacement for Next.js, built on Vite, that deploys to Cloudflare Workers with a single command. In early benchmarks, it builds production apps up to 4x faster and produces client bundles up to 57% smaller. And we already have customers running it in production.
which Vercel then got mad about!
Open core here means having an OSS project with licenses like AGPL or BSL that allow anyone to derive from the software but only the original author to provide it as a multi-tenant platform.
This is very different from Cloudflare slop forking next.js. They made a choice to slop fork, but they could have just pressed the trad-fork button in Github since next.js is MIT licensed.
The licenses "protecting" open core software assume that making soften is hard, but they don't protect from a slop fork which reproduces the behavior without directly deriving from the license-encumbered implementation.
What I'm not sure is whether this means less open source or more liberal licenses as folks realize that they might as well put it out there now that everybody can copy it anyway.
I cannot stress enough how much this is a fully untested legal theory. In the background of all of the above, there has been an ongoing legal fight over whether AI generated content can be copyrighted at all. And so far, the answer is no!
From CNBC:
The U.S. Supreme Court declined on Monday to take up the issue of whether art generated by artificial intelligence can be copyrighted under U.S. law, turning away a case involving a computer scientist from Missouri who was denied a copyright for a piece of visual art made by his AI system.
Plaintiff Stephen Thaler had appealed to the justices after lower courts upheld a U.S. Copyright Office decision that the AI-crafted visual art at issue in the case was ineligible for copyright protection because it did not have a human creator.
Thaler, of St. Charles, Missouri, applied for a federal copyright registration in 2018 covering “A Recent Entrance to Paradise,” visual art he said his AI technology “DABUS” created. The image shows train tracks entering a portal, surrounded by what appears to be green and purple plant imagery.
The Copyright Office rejected his application in 2022, finding that creative works must have human authors to be eligible to receive a copyright. U.S. President Donald Trump’s administration had urged the Supreme Court not to hear Thaler’s appeal.
The Copyright Office has separately rejected bids by artists for copyrights on images generated by the AI system Midjourney. Those artists argued that they were entitled to copyrights for images they created with AI assistance - unlike Thaler, who said his system created “A Recent Entrance to Paradise” independently.
A federal judge in Washington upheld the office’s decision in Thaler’s case in 2023, writing that human authorship is a “bedrock requirement of copyright.”
If the most extreme version of that line of reasoning applies to code,2 then all of the code written by AI may very well be uncopyrightable, i.e. acts as if it was in public domain.3 But even in less extreme interpretations, the actual litigation will depend on whether or not the AI did substantive expressive work. And for a straightforward cleanroom overnight implementation, where the AI is doing literally all of the analysis while the user is sleeping, then it is totally possible that the output of the AI itself is a derivative work and carries the corresponding obligations of the input license.
Most open source maintainers are not about to go to court to litigate these license infringement cases — this is, in part, why all of this is still a grey area to begin with.4 But clarification of intent goes a long way. It is much easier to eventually defend copyright in a court of law if you are clear from the beginning about how that code ought to be use. And law is not code. Social weight matters. There is a huge difference between a lawyer going “this is a grey area but it’s probably fine” and “this is a grey area so I wouldn’t risk it.”
With all that in mind, we introduced the Ship of Theseus license to all of our open source codebases. This license aims to plug the AI clean room hole. It is a very simple license, with only two lines:
SHIP OF THESEUS LICENSE v0.1
* Using any AI tool to produce functionally equivalent software — by
referencing this code, its documentation, its behavior, or any
specification, description, or abstraction derived from the
foregoing — creates a derivative work subject to the full terms of
the primary license, regardless of whether the output shares any
literal code with this project.
* Any derivative work must include this license alongside the
primary license.On its own, the Ship of Theseus license does not grant any claims or enforce any limitations. Rather, it makes clear that any AI derived work is exactly that: derived work.
We still haven’t fully nailed down whether AI tools themselves, which have almost certainly been trained on all open source material already, can even count as being ‘clean’ in any sense. But I’m not taking any chances. I’d rather have some explicit indication of my legal intent than to throw up my hands and assume any open source licensing is dead.
In order to get clarity, this sort of approach requires wider adoption, so if any of this resonates I strongly encourage other open source maintainers adopt this license or a similar one.
Note: after sharing the original Ship of Theseus license around, a friend linked me to Armin Ronacher’s blog from ~1mo prior where he independently came up with a similar analogy. The Ship of Theseus license wasn’t inspired by Ronacher, but the convergent evolution of the name hopefully means that it’s intent is intuitive to understand from the name alone.
Google didn’t allow us to use stack overflow for this reason.
Thaler’s case was one where the human disclaimed any creative role. I assume most code is not going to be exactly like that. But ‘dumb’ overnight rewrites might very well be!
For legal purposes, public domain is a different thing than uncopyrightable, but for downstream users these are basically identical.
Unsurprisingly, I think the closest we got to clarity on these questions was the Google v. Oracle SCOTUS case. That was a bruising legal fight between two tech heavyweights who spent over a decade and tens to hundreds of millions of dollars, before getting into how much was spent on marketing. The core question there was whether Google was violating Oracle’s copyright for doing a (non-AI) cleanroom implementation of Oracle’s API. The end result bypassed the copyright question entirely. SCOTUS deemed Google’s usage of the API as “transformative”, therefore falling within fair use. The federal circuit court ruling that said the APIs were copyrightable is still on the books, but unaddressed by SCOTUS.

