Remember Cat Bin Lady? Back in 2010 Mary Bale, AKA Cat Bin Lady, was briefly public enemy number one when CCTV of her putting a cat in a bin went super-viral. There was a single-topic parody Twitter account (remember those?), fake social accounts purporting to be her, and the following actual quote from a Facebook spokesperson in a newspaper (remember those?) article, "We can't comment on individual cases for privacy reasons but I can tell you that one group, entitled Death to Mary Bale, has been removed today." It was all very 2010 social media.
There’s still a live public fake Mary Bale Facebook account, set up at the time to troll cat lovers with posts like “can i borrow ene 1s bin? mines full”. Now 13 years later, some people who friended the account, presumably to send abuse, have forgotten who they added and each year reflexively wish happy birthday to a fake version of a cat jailer (see image above).
Imagine a large language model learning from this version of the Internet. An LLM ingesting the entire works of Shakespeare, then undoing all the good work by gorging on Boomer Facebook. Which is a roundabout way of getting to datasets, the lifeblood of large language models.
Datasets are used to train LLMs and the better the data, the better the training. There’s one dataset in particular that’s of interest to people focused on AI, and what it means for creatives, and that is Books3. Books3 is a dataset of approximately 170,000 written works, about one-third fiction, that has been available to AI developers since October 2020. Books3 has been mentioned in lawsuits filed by authors Sarah Silverman, Christopher Golden and Richard Kadrey against both OpenAI and Meta, and a recent report in The Atlantic detailed the work of more authors, including James Patterson, Stephen King and Zadie Smith, who are included in the dataset.
Why is this important? Because it is one of a number of cases we will see over the coming years that test the boundaries of copyright law in relation to AI training. Getty Images claim Stability AI scraped images protected by copyright for its Stable Diffusion tool, three artists have brought a lawsuit again Stability AI, Midjourney and Deviant Art, essentially anyone who suspects their creative work has been used, without permission, in training a model could be tempted to pursue this through the courts.
And it’s not simply a case of poor artists versus big corporations. Many corporations would stand to profit from AI companies being forced to pay for licensed content in training models. And the Book3 dataset itself was created by an independent developer who believes open-source datasets allow individuals to compete with giants like OpenAI and Meta.
For generative AI to work as well as it could, it needs to be trained on our great, and good, and fine, and terrible, works of art across all mediums. The emerging generation of AI artists will be standing on the shoulders of giants. It remains to be seen if those giants will be paid. And should they need to be paid, will that mean free generative AI equals Boomer Facebook-quality generative AI?
Trust the process
There’s an academic maxim that Wikipedia can be the first site you visit when researching, but it should never be the last. Associated Press (AP) has issued similar guidance for staff use of ChatGPT, with which it has a licensing agreement. “While AP staff may experiment with ChatGPT with caution, they do not use it to create publishable content. Any output from a generative AI tool should be treated as unvetted source material. AP staff must apply their editorial judgment and AP’s sourcing standards when considering any information for publication.”
The guidance (full quotes published here) could work well for many companies outside of news and publishing and is generally common sense around not abandoning human experience and expertise. But it’s still all based on trust, oversight and knowledge of the limitations of existing chat tools.
Essentially, “this copy is great so of course you, a human, wrote it”. For the moment, in a well-run operation, anyone heavily relying on a chatbot for copy will get found out very quickly, there’s too much of a quality gap.
But it will be interesting to see if that trust erodes as the gap closes.
AI and the culture wars
The above tweet is both grim and funny and representative of a lot of people’s experience of gen AI, where a biased world is reflected back when using these tools. In a recent Rolling Stone piece, multiple women in tech detailed their frustrations at tackling a range of biases they encountered when working in AI.
But it’s 2023 and, of course, AI is going to feature in the culture wars. So the tweet, like many social posts highlighting similar issues, is getting reshared with lots of AI defenders who can’t allow their new favorite tool to be besmirched like this. While a conservative talking point that AI is ‘left wing’ sees many of the same arguments switch sides.
At their core, many of the arguments are essentially just a question of semantics. Technically, it’s the data, not the AI, that is biased. Which sort of misses the point.
Good stuff to know
A US federal judge ruled on Friday that AI-created art cannot be copyrighted.
Custom instructions are now available to all free users of ChatGPT
Has your book been used to train the AI? A March Substack post that explored the use of Book3