Copyright law makes a case for requiring data information rather than open datasets for Open Source AI

The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.

This series features the voices of the volunteers who have helped shape and are shaping the Definition.

Meet Felix Reda

Photo Credit: CC-by 4.0 International Volker Conradus volkerconradus.com.

Felix Reda (he/they) has been an active contributor to the Open Source AI Definition (OSAID) co-design process, bringing his personal interest and expertise in copyright reform to the online forums. Working in digital policy for over ten years, including serving as a member of the European Parliament from 2014 to 2019 and working with the strategic litigation NGO Gesellschaft für Freiheitsrechte (GFF), Felix is currently the director of developer policy at GitHub. He is also an affiliate of the Berkman Klein Center for Internet and Society at Harvard and serves on the board of the Open Knowledge Foundation Germany. He holds an M.A. in political science and communications science from the University of Mainz, Germany.

Data information as a viable alternative

Note: The original text was contributed by Felix Reda to the discussions happening on the Open Source AI forum as a response to Stefano Maffulli’s post on how the draft Open Source AI Definition arrived at its current state, the design principles behind the data information concept and the constraints (legal and technical) it operates under.

When we look at applying Open Source principles to the subject of AI, copyright law comes into play, especially for the topic of training data access. Open datasets have been a continuous discussion point in the collaborative process of writing the Open Source AI Definition. I would like to explain why the concept of data information is a viable alternative for the purposes of the OSAID.

The definition of Open Source software has an access element and a legal element – the access element being the availability of the source code and the legal element being a license rooted in the copyright-protection given to software. The underlying assumption is that the entity making software available as Open Source is the rights holder in the software and is therefore entitled to make the source code available without infringing the copyright of a third party, and to license it for re-use. To the extent that third-party copyright-protected material is incorporated into the Open Source software, it must itself be released under a compatible Open Source license that also allows the redistribution.

When it comes to AI, the situation is fundamentally different: The assumption that an Open Source AI model will only be trained on copyright-protected material that the developer is entitled to redistribute does not hold. Different copyright regimes around the world, including the EU, Japan and Singapore, have statutory exceptions that explicitly allow text and data mining for the purposes of AI training. The EU text and data mining exceptions, which I know best, were introduced with the objective of facilitating the development of AI and other automated analytical techniques. However, they only allow the reproduction of copyright-protected works (aka copying), but not the making available of those works (aka posting them on the internet).

That means that an Open Source AI definition that would require the republication of the complete dataset in order for an AI model to qualify as Open Source would categorically exclude Open Source AI models from the ability to rely on the text and data mining exceptions in copyright – that is despite the fact that the legislator explicitly decided that under certain circumstances (for example allowing rights holders to declare a machine-readable opt-out from training outside of the context of scientific research) the use of copyright-protected material for the purposes of training AI models should be legal. This result would be particularly counterproductive because it would even render Open Source AI models illegal in situations where the reproducibility of the dataset would be complete by the standards discussed on the OSAID forum.

Examples

Imagine an AI model that was trained on publicly accessible text on the internet that was version-controlled, for which the rights holder had not declared an opt-out, but which the rights holder had also not put under a permissive license (all rights reserved). Using this text as training data for an AI model would be legal under copyright law, but re-publishing the training dataset would be illegal. Publishing information about the training dataset that included the version of the data that was used, when and how it was retrieved from which website, and how it was tokenized would meet the requirements of the OSAID v 0.0.8 if (and only if) it put a skilled person in the position to build their own dataset to recreate an equivalent system. 

Neither the developer of the original Open Source AI model nor the skilled person recreating it would violate copyright law in the process, unlike the scenario that required publication of the dataset. Including a requirement in the OSAID to publish the data, in which the AI developer typically does not hold the copyright, would have little added benefit but would drastically reduce the material that could be used for training, despite the existence of explicit legal permissions to use that content for AI training. I don’t think that would be wise.

The international concern of public domain

While I support the creation of public domain datasets that can be republished without restrictions, I would like to caution against pointing to these efforts as a solution to the problem of copyright in training datasets. Public domain status is not harmonized internationally – what is in the public domain in one jurisdiction is routinely protected by copyright in other parts of the world. For example, in US discourse it is often assumed that works generated by US government employees are in the public domain. They are not, they are only in the public domain in the US, while they are copyright-protected in other jurisdictions. 

The same goes for works in which copyright has expired: Although the Berne Convention allows signatory countries to limit the copyright term on works until protection in the work’s country of origin has expired, exceptions to this rule are permitted. For example, although the first incarnation of Mickey Mouse has recently entered the public domain in the US, it is still protected by copyright in Germany due to an obscure bilateral copyright treaty between the US and Germany from 1892. Copyright protection is not conditional on registration of a work, and no even remotely comprehensive, reliable rights information on the copyright status of works exists. Good luck to an Open Source AI developer who tried to stay on top of all of these legal pitfalls.

Bottom line

There are solid legal permissions for using copyright-protected works for AI training (reproductions). There are no equivalent legal permissions for incorporating copyright-protected works into publishable datasets (making available). What an Open Source AI developer thinks is in the public domain and therefore publishable in an open dataset regularly turns out to be copyright-protected after all, at least in some jurisdictions. 

Unlike reproductions, which only need to follow the copyright law of the country in which the reproduction takes place, making content available online needs to be legal in all jurisdictions from which the content can be accessed. If the OSAID required the publication of the dataset, this would routinely lead to situations where Open Source AI models could not be made accessible across national borders, thus impeding their collaborative improvement, one of the great strengths of Open Source. I doubt that with such a restrictive definition, Open Source AI would gain any practical significance. Tragically, the text and data mining exceptions that were designed to facilitate research collaboration and innovation across borders, would only support proprietary AI models, while excluding Open Source AI. The concept of data information will help us avoid that pitfall while staying true to Open Source principles.

How to get involved

The OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:

  • Join the forum: share your comment on the drafts.
  • Leave comment on the latest draft: provide precise feedback on the text of the latest draft.
  • Follow the weekly recaps: subscribe to our monthly newsletter and blog to be kept up-to-date.
  • Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions and share your thoughts.
  • Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.