Good Models Borrow, Great Models Steal: Intellectual Property Rights and Generative AI

By Simon Chesterman

Two critical policy questions will determine the impact of generative artificial intelligence (AI) on the knowledge economy and the creative sector. The first concerns how we think about the training of such models — in particular, whether the creators or owners of the data that are ‘scraped’ (lawfully or unlawfully, with or without permission) should be compensated for that use.

The second question revolves around the ownership of the output generated by AI, which is continually improving in quality and scale. These questions are inherently linked to the realm of intellectual property: a legal framework designed to incentivise and reward only human creativity and innovation. For some years, however, the United Kingdom has maintained a distinct category with limited rights for ‘computer-generated’ outputs; on the input issue, the European Union and Singapore have recently introduced exceptions allowing for text and data mining (TDM) or computational data analysis of existing works.

My working paper that is in preparation for a Special Issue on ‘Governance of Generative Artificial Intelligence’ in the Policy & Society Journal explores the broader implications of these policy choices, weighing the advantages of reducing the cost of content creation and the value of expertise against the potential risk to various careers and sectors of the economy, which may be rendered unsustainable.

Lessons may be found in the music industry, which also went through a period of unrestrained piracy in the early digital era, epitomised by the rise and fall of the file-sharing service Napster. Similar litigation and legislation may help navigate the present uncertainty, along with an emerging market for ‘legitimate’ models that respect the copyright of humans and are clear about the provenance of their own creations.

Is Scraping Fair Use?

AI has always depended on access to data. Large language models (LLMs), in particular, are trained on huge datasets, comprising publicly available material as well as copyrighted and pirated material available online. How (if at all) should the rights of creators, whose text and images train such models, be recognised and compensated? The use of pirated or illegally obtained material appears at first blush to be a simple case of theft of intellectual property, but has been notoriously difficult to prove. Around the world, concepts like fair use are being stretched by the wholesale consumption of books, photographs, and other materials. Litigation followed.

Singapore is an example of a jurisdiction that has tried to thread this needle through legislation. Amendments to its Copyright law in 2021 include a permitted use to make a copy of a work for the purpose of ‘computational data analysis’, which includes extracting and analysing information and using it to ‘improve the functioning of a computer program in relation to that type of information or data’. The provision still requires lawful access to the underlying data, but appears more open to datamining and model training than traditional conceptions of fair use, the ‘non-commercial’ text and data analysis exception adopted in the United Kingdom in 2014, or the ‘text and data mining’ (TDM) exception adopted by the European Union in 2019.

An information sheet produced by the Intellectual Property Office of Singapore (IPOS) explicitly states that the provision is intended to allow ‘training machine learning’. Yet, analysing text or images for the purpose of making recommendations or optimising workflows is quite distinct from using those texts and images to generate more text and images. The difference is not just the usage, where copying is central to the process, but also the economic impact of that usage.

The music industry may offer some lessons about where we go from here. It also went through a period of unrestrained piracy in the early digital era, which radically transformed the economics of copying and gave rise to file-sharing services such as Napster. Lawsuits and legislative changes led to most media platforms adopting copyright policies and takedown protocols, while those like Napster were shuttered completely.

It is possible that a similar evolution will take place in generative AI. Adobe, for example, has built its Firefly tools using training sets consisting only of public domain and licensed works. Shutterstock has also committed to building AI tools with a Contributor Fund to compensate artists.

Other approaches are possible, such as the way YouTube allows certain usages of music and other copyrighted material by sharing advertising revenue with owners of the original work through its Content ID system. Another option is provision for content creators to ‘opt out’ of being scraped for their data, either through the site’s robots.txt file or registering its internet protocol (IP) address.

Who Owns the Output?

In most jurisdictions, automatically generated text does not receive copyright protection. The U.S. Copyright Office has stated that legislative protection of ‘original works of authorship’ is limited to works ‘created by a human being’. It will not register works ‘produced by a machine or mere mechanical process that operates randomly or automatically without any creative input or intervention from a human author’ (emphasis added). The word ‘any’ is key and begs the question of what level of human involvement is required to assert authorship.

An alternative approach, adopted in Britain is to have more limited protections for ‘computer-generated’ work, the ‘author’ of which is deemed to be the person who undertook ‘the arrangements necessary for the creation of the work’. ‘Computer-generated’ is defined as meaning that the work was ‘generated by computer in circumstances such that there is no human author of the work’. Similar legislation has been adopted in New Zealand, India, Hong Kong, and Ireland. Though disputes about who took the ‘arrangements necessary’ may arise, ownership by a recognised legal person or by no one at all remain the only possible outcomes. The duration is generally for a shorter period, and the deemed ‘author’ is unable to assert moral rights — including the right to be identified as the author of the work.

A World Intellectual Property Organization (WIPO) issues paper recognised the dilemma, noting that excluding these works would favour ‘the dignity of human creativity over machine creativity’ at the expense of making the largest number of creative works available to consumers. A middle path, it observed, was to offer ‘a reduced term of protection and other limitations’. Several commentators have suggested similar approaches.

As human authorship becomes more ambiguous, that middle ground may help preserve and reward flesh and blood authorship, while also encouraging experiments in collaboration with our silicon and metal partners.

Good Authors Borrow?

T.S. Eliot once observed that ‘good authors borrow, great authors steal’.

Eliot was not, of course, condoning plagiarism. His larger point was to challenge naïve idealisation of the creative process: in arts, as much as in science, each new thinker and writer builds on the work of those who have come before. Painters inspire and echo one another; writers offer variations on plots and structures that can be mapped and catalogued.

This is clearest in music, where the limits of the heptatonic scale and chord progressions mean that melodies will inevitably echo one another, as Ed Sheeran successfully argued in a case concerning similarities between his hit song ‘Thinking Out Loud’ and Marvin Gaye’s ‘Let’s Get It On’.

It may seem pointless to argue that AI models should pay for the use of data when the entire Internet has already been absorbed. In addition to the market for ‘legitimate’ models, however, there is evidence that further refinement of those models and the training of new ones depends not just on the volume of data but its quality.

Early suggestions that LLMs might continue improving based on synthetic data that they themselves create have foundered on projections that such AI-generated data will ‘poison’ future models. Presuming that there is an ongoing market for data and the political will to regulate it, the idea that generative AI will have its own ‘Napster moment’ is at least plausible.

An op-ed drawing on some of the material in this post was published in the Straits Times on 24 October 2023.

Keywords:  Artificial Intelligence, Generative AI, Large Language Models, ChatGPT, GPT-4, Copyright, Intellectual Property

AUTHOR INFORMATION

Professor Simon Chesterman is David Marshall Professor and Vice Provost (Educational Innovation) at the National University of Singapore, where he is also the founding Dean of NUS College. He serves as Senior Director of AI Governance at AI Singapore and Editor of the Asian Journal of International Law.

Email:  chesterman@nus.edu.sg

Twitter/X   

LinkedIn