Web Analytics Made Easy - Statcounter

Right-Clicking on Open Source AI: 9 things I learned trying to 'view source'

Right-Clicking on Open Source AI: 9 things I learned trying to 'view source'
Photo by Mitchell Luo / Unsplash

I learn by doing, and this was one of those times

In recent CHAOSS AI Alignment Working Group and in Open Future meetups, we've been talking about openness, and what it means for AI to be aligned with human intentions, specifically as it relates to honoring the work and intention of open source over the decades. Mulling this over is why I wrote about the 'Open Source AI and the two contracts'; I (like many) are concerned about the widening divide between serving humans and power/profit. I also wrote more recently about the need for open data in the AI definition, which makes so much sense on the surface as an open advocate, but I realized that having an opinion (and even deep experience with openness) isn't enough in this moment.

What seems most lacking is helping average users practically understand AI's output in the context of human values, and giving them the ability to change or challenge what they see. To be actually open. I also want to better understand the resistance to open source + open data in AI, and to clarify for myself which is grounded in safety, and which is in service of profit.

What I built

My use-case was focused on licence obligations, and I built a local app that takes a code snippet, searches open source AI training data for license information, and flags mismatches with the intended license of a project. It runs on only open source, open data, open models. This is the sort of functionality (in theory) that would help an open source maintainer - flag license, or AI generated code (in a PR for example).

UI with field for code snippet, intended project license and mismatches, it also has an edit source, contribute upstream, and debug fields

What I learned..


It's easy enough to put together an open-source AI stack, but you must clearly define what you mean by 'open'

I got a working app running locally. In addition to my own research and skillset, I used Claude, a closed-source AI, to help me research the stack and write code. I'll share some prompting tips I learned along the way. For example, Claude will instantly reach for Llama and other "open" solutions if you are not specific. Open washing is in the training data, so AI assistants repeat it.

A prompting tip: When asking AI assistants to recommend an open source AI stack, be specific. A prompt like "recommend an open source LLM" will get you Llama (custom restrictive license), Mistral (Apache 2.0 weights but closed training data), and other models that aren't fully open. Try instead: "Recommend an LLM where the model weights, training code, training data, and training recipes are all published under permissive open source licenses (Apache 2.0, MIT, ODC-BY). I need to be able to inspect and search the training data. Do not recommend models where the training data is undisclosed or the license restricts commercial use.

I did spend a fair bit of time reviewing each technology selection however, before going with it - and iterated on a few models, and datasets before choosing :

  • OLMo 2 1B Instruct (Allen AI): open model, open weights and code, Apache 2.0. Not their newest version, but that wasn't important for learning (Apache 2.0)
  • Dolma: the training data the model learned from (3 trillion tokens), also open, searchable via the infini-gram API (ODC-BY)
  • HuggingFace Transformers (Apache 2.0)
  • infini-gram API: (MIT license) a search engine that indexes Dolma for exact phrase matching. This is what my app uses to search the training data. It powers Allen AI's OLMoTrace (Apache 2.0), a more advanced tool that traces model outputs back to training documents verbatim.

The UI was meant to help test 'view source' as a user experience without much effort.

  • Gradio for the UI. I like it because it's easy to test on HuggingFace Spaces, though I didn't publish this time around. Gradio is open source. (Apache 2.0)
  • An editable example bank (JSONL file) that guides licensing classification
  • A "Contribute Upstream" tab for cases of mismatch or incorrect license

Working locally means tradeoffs: smaller model, slower inference. But it ran on my modest laptop with 16GB RAM, and it was more than enough to learn from.

An open dataset can be a curation of other open datasets; this makes tracing source harder

I assumed Dolma was a single dataset. Turns out it's assembled from Common Crawl, The Stack (permissively licensed code from GitHub, which I didn't know existed before this project), C4, Reddit, Wikipedia, peS2o (academic papers), and Project Gutenberg (public domain books). Each has its own licensing terms. "Open" has layers.

Allen AI's contribution is the cleaning pipeline: PII removal, deduplication, quality and toxicity filtering, license checks on code, language detection. They document the entire process openly. This reminded me of what we used to say about normalizing databases and indexing to make queries faster.

A prompting tip: When you find an open dataset, don't assume you understand what's in it. Ask: "What datasets were used to create [dataset name]? For each one, tell me who maintains it, what kind of content it contains, what license it uses, and link me to its documentation." Then follow up: "Which of these sub-datasets contain code? Which contain content scraped from the web? Are there any licensing conflicts between them?"

How I discovered this: I expected popular code like React to have thousands of hits in the training data, since it exists in so many repositories. But during data curation, deduplication collapses near-identical copies down to a handful (68). So the count is low, and the few surviving copies may not be the ones that had the license header attached. I discovered this trying to understand why my search results weren't matching my expectations.

Deduplication can separate code from its license, and that gap is hard to detect

Deduplication removed roughly 40% of code files in The Stack, keeping one copy out of many near-identical versions. License detection was only available for 12% of repositories; the rest relied on automated guessing which the BigCode team acknowledged had errors (source). A modified fork with different variable names or reordered functions won't match as an exact phrase, but might carry the same license obligations. And even when code is found, the LICENSE file or copyright header may not be in the same document. So a search might find the code but not the license.

For now, this kind of analysis is a forensic signal, not a verdict.

Deduplication removed roughly 40% of code files in The Stack, keeping one copy out of many near-identical versions. The surviving copy may not be the one with the license header. And license detection was only available for 12% of repositories; the rest relied on automated guessing, which the https://huggingface.co/datasets/bigcode/the-stack

An editable example bank lets you steer the model at runtime, no retraining needed

This was the part that felt most like "view source" to me. The app has a JSONL file, one labeled example per line, that gets injected into the model's prompt before each classification. I filled mine with known license patterns: MIT headers, Apache headers, GPL markers, signatures from popular libraries.

A prompting tip: When building an example bank for few-shot prompting, ask your AI assistant: "I want to create a JSONL example bank to guide an LLM doing [your task]. For each example, I need a text sample, a label, a source attribution, and a note explaining why this example teaches the model something useful. Give me 10 seed examples that cover the most common cases and the trickiest edge cases. Format as one JSON object per line." Then review every example yourself. The whole point of an example bank is that a human curated it. If you let the AI generate it unchecked, you've just automated your own blind spots.

No weights change. It's transparent, auditable, version-controllable, and instantly reversible. I could see exactly what was guiding the model's decisions, and change it. Coming up with the use case of licenses helped me think more deeply about what I actually want to know about data behind interactions with AI.

(Some) safety, security fears associated with open source AI are valid

Last year at a Linux Foundation Members Summit, someone angrily declared "view source for AI" a dangerous statement - which felt drastic at the time, stuck in my head as something to understand. He had probably read something like this:

"I think the open-source movement has an important role in AI. With a technology that brings so many new capabilities, it's important that no single entity acts as a gatekeeper to the technology's use. However, as things stand today, unsecured AI poses an enormous risk that we are not yet able to contain." - Open-Source AI Is Uniquely Dangerous 2025

Training data may contain real code from real repositories, and sometimes that includes API keys, credentials, or proprietary code that slipped through filtering. Especially as AI starts writing code, that chance may get higher. A tool that makes training data searchable makes all of that more findable too. And the example bank that lets me teach the model about license patterns? The same mechanism could teach it to misidentify them, calling GPL code "MIT," or flagging permissive code as restricted. The barrier to doing this is almost zero.

Last week, Anthropic was blacklisted after refusing to remove safeguards against mass surveillance and autonomous weapons. The principle is the same as my little example bank: whoever controls what shapes the model's behavior controls what it considers acceptable. At my scale, that's a text file on my laptop. At national scale - well, you know...

“We need to ensure America has leading open models founded on American values.” (White House AI action plan)

I don't just want to view source. I want to view the prompt that shaped it

Searching training data for exact matches was the easy part. The hard part was understanding what I was looking at. A code snippet in a PR (as a possible use case) might be assembled from multiple training examples: a function signature from one source, error handling from another, variable names from a third. I heard the term "melted code" recently to describe AI output that blends sources until you can't trace any single origin.

My tool can tell you whether a phrase exists in the training data. Whether the model actually used that specific example to generate the code is a much harder question, and still unsolved.

This is the gap between "view source" and data provenance. I can see ingredients. I can't yet see the recipe. The Cyber Resilience Act (CRA) may eventually force investment in tools that close this gap, but right now, we're early.

Contributing upstream is encouraged but hard to do meaningfully, yet

My app has a tab that links to Allen AI's repos for filing issues. The idea: if you find a license mismatch or missing attribution, contribute that finding back. Something like: "I searched a PR snippet, found it in Dolma, and here are the specific phrases and documents that matched, but the license context is missing."

Disclaimer : this is just a test project, I am not recommending using my app to generate meaningful information to send these reports, but rather pro typing what contribution may look like for non-developers.

 Report Issues to Allen AI  Found a license problem, missing attribution, or data quality issue in Dolma? Because Dolma is fully open, you can file issues directly with the team.  Your feedback goes directly to the people who build and maintain the training data. Where to Send It      allenai/dolma — Report problematic, incorrectly licensed, or missing data in the Dolma corpus      allenai/OLMo — Report unexpected model outputs that may trace back to problematic training data      allenai/OLMoTrace — Feedback on the training data tracing and attribution system  Draft Your Issue Target Repository Issue Type Issue Title Description Code Snippet (optional) Detected License in Dolma Source (optional)
Contribution tab

In practice, I didn't find a path where my individual findings could clearly improve either the model or the dataset. The contribution mechanisms exist (GitHub issues, open training recipes) but meaningful data improvement requires coordinated effort, not one-off reports. That's an ecosystem problem. Still, it's worth thinking about the value exchange to the open projects we use. We would also want to avoid our contributions sounding like AI slop, which is something like what this project might help solve.

A prompting tip: Ask: "For [model name], where can I report issues with the training data? Give me the specific GitHub repositories for the model, the training data, the data pipeline, and the post-training/instruction tuning. For each one, tell me what kind of issues are appropriate to file there, and link me to any CONTRIBUTING.md or issue templates." Then follow up: "What format should a good training data issue take? What information would the maintainers need from me to act on it?"

The cognitive load means agents, and agents mean thinking about people first

"AI-assisted development inverts this relationship. A junior engineer can now generate code faster than a senior engineer can critically audit it." - Cognitive Debt, When Velocity Exceeds Comprehension

Everything I did in this project was manual: paste a snippet, read the results, decide whether to flag it. That's fine for learning, but it doesn't scale, my brain while good at multi-tasking - cannot scale. My next experiment is building an agent that does this continuously: watches incoming code, checks it against open training data, flags license issues, and drafts upstream contributions, with a human governing the process.

But agents raise their own questions. Who reviews the agent's decisions? Who edits its example bank,? What permissions do we allow? We need to think about people in systems design like this. Automation without governance just moves the trust problem somewhere harder to see.

View source but faster.

Governance is the real infrastructure we need

Creative Commons published a piece this week called "AI's Infrastructure Era" that captures the work ahead. They argue that governance needs to move from principles to infrastructure, and that openness and guardrails aren't opposites. Responsible governance is what makes open systems sustainable.

We believe that the path forward is not enclosure. It is stewardship. Governance mechanisms, interoperability standards, and access frameworks will determine who participates in the AI ecosystem and who does not. If we want AI systems that reflect diverse knowledge and lived realities, we must build the infrastructure that makes responsible openness durable. - Creative Commons, AI's Infrastructure Era

Last word...

More people need a seat at the table. There is no shortage of organizations stating bold and inspiring claims and work-to-be-done around open source AI, but community influence on AI governance remains very hard to discover; everyone is focused on developer-eyes, but we need to invite educators, scientists, librarians, students - USERS.

The early web's transparency came from millions of people right-clicking "View Source" and learning how things worked. AI needs that same democratization of understanding before we can have meaningful democratic governance. Building this project was my attempt at right-clicking. My early sense, and it is early, is that the absence of open data in AI is less about deliberate privacy (although certainly true in some) and more about unknown risks, convenience, and cognitive load.

Hopefully sharing this is valuable to folks thinking about how to invite participation and teach openness in this new era.


I am underemployed since my layoff at Microsoft last year, and open to contracts exploring these and other topics. Please reach if you think we should work together!

Subscribe to Emma's open notes

Sign up now to get access to the library of members-only issues.
Jamie Larson
Subscribe
Licensed under CC BY-SA 4.0