Fair use? Who owns our data?

There is a growing conversation about data rights, suggesting that the debate over digital ownership is just beginning. Apart from several well publicised legal cases in the USA over scraped data, in the UK the government has been forced to launch an urgent review, after facing a fierce campaign by musicians and employees in the creative industries. The big tech AI companies have been claiming that scraping data for training Large Language Models represents fair use, even when being explicitly asked not to do so.
Now the Mastodon social network, launched in 2016, allowing users to create their own servers while connecting across a federated model, offering an alternative to mainstream platforms, has updated its terms of service to explicitly prohibit the use of its content for training AI models.
The new terms of service say:
- We explicitly prohibit the scraping of user data for unauthorized purposes, e.g. archival or large language model (LLM) training. We want to make it clear that training LLMs on the data of Mastodon users on our instances, is not permitted.
- We clarify that you maintain rights over the works you upload to our service (i.e. your posts, images, audio, video etc. - your content), but grant us a license necessary for the functioning of the service, e.g. displaying your works to your followers.
- We’re setting a new minimum age requirement of 16 years. Previously this was cited as 13, for users in the US specifically. Now, it is 16, everywhere.
- We’re adding information about the DMCA and other legal notices, as well as disputes and arbitration; along with the prohibited uses section, which underlies our server rules.
And Creative Commons have called for a new pact for reciprocity in the age of AI. They say "We believe reciprocity must be embedded in the AI ecosystem in order to uphold the social contract behind sharing. If you benefit from the commons, and (critically) if you are in a position to give back to the commons, you should. Because the commons are for everyone, which means we all need to uphold the value of the commons by contributing in whatever way is appropriate."
A recent invitation to an online event on June 25 says:
We invite you to join us for the CC Signals Kickoff Event!
We’re launching the first iteration of a preference signals framework, which we are provisionally calling CC signals. CC signals are designed to offer a new way for stewards of large collections of content to indicate their preferences as to how machines (and the humans controlling them) should contribute back to the commons when they benefit from using the content.
We are kicking off the first phase of this project by inviting public feedback on a paper prototype. Your engagement while we collectively collaborate on a tool that infuses reciprocity into the AI ecosystem and protects a thriving creative commons in the age of AI is paramount.
This isn't going to go away!
About the Image
This image represents the interest of technology companies (including but not limited to AI) in the data produced by students. Young students at computers with retinal scanners on their screens suggest the uptake not only of data entry but also of biometric data. The representation of pixelation, binary code, and the data-wave heatmap at the top suggest the ways that student work - and bodies - are abstracted by the datafication process. Design created using public domain images and effects in Rawpixel.