September 3, 2022 – curtain up! At the Semperoper in Dresden, a hitherto unknown singing talent faced an audience of opera professionals. After Andrew Lloyd Webber’s Cats in August and The Magic Flute in mid-September wowed opera audiences, the spotlight turned to artificial intelligence (AI) later in the month.
But the real mastermind behind the innovative opera project is director and media artist Sven Sören Beyer. He has worked together with Berlin artist collective phase7 performing.arts since 1999, putting together performance productions and installations. He is always sounding out the area of tension between man and machine. In his latest project, he reflects on the influence that AI has had on our lives.
Johann Casimir Eule, chief dramaturge at the Semperoper, saw the potential for innovation when Beyer’s proposal landed on his desk around two years ago. “We’ve worked with research institutes for quite some time and have now been able to combine this and bring the ‘old dog’ of the Semperoper into new technical dimensions. Perhaps this type of musical theater will be pioneering in 15 or 20 years,” says Eule. This is why those involved agreed that AI should play a significant role in the creation of the opera.
The story might remind some people of Tron. In the pioneering Disney film from 1982, programmer Jeff Bridges finds himself a prisoner within a computer network and, with the help of his program Tron (a kind of alter ego) tried to escape –and put a stop to the Master Control Program, the AI alter ego of his adversary David Warner. What was purely science fiction at the time, now, 40 years later, has a real, somewhat bitter taste of dystopia in Chasing Waterfalls. Norwegian soprano Eir Inderhaug from the cast of the Bavarian State Opera is confronted with not one, but six ‘digital twins’ she needs to grapple with.
Back then it was a laser, now it’s a simple login. When the real, physical selves log into the computer, they are faced with digital copies of themselves: the numerous traces that they leave behind in the digital world of the Internet. They encounter their digital twins, Ego Fluentes, which operate in the virtual world as independent personalities, and eventually even team up against the physical selves and attack reality.
Will the physical selves be able to assert themselves as real people against the dissolving boundaries? The boundaries between the virtual and physical world become blurred on stage, with the stage design highlighting this effectively.
But what about the real world? Beyer wants to use his work to contribute to the discussion surrounding how digital our personalities are today, prompting age-old questions such as: What is truth (in an increasingly digitalized world)? What makes us human?
Over the course of just 70 minutes, Chasing Waterfalls immersed the audience in a world in which opera and digitalization merged. AI had a key role to play in this evening too, before moving on to its next engagement in Hong Kong, the home of composer Angus Lee. Naturally, it didn’t get stage fright and won’t in Hong Kong either – but perhaps it has the mannerisms of an opera diva? How can you make AI actually start singing?
“With all the achievements AI can already boast of a ready-made AI opera singer does not yet exist,” smiles Nico Westerbeck, who worked as the technical lead with the AI, training it from a beginner in the girls’ choir to an opera soloist. Westerbeck is a passionate computer and data scientist. He has worked at T-Systems Multimedia Solutions (MMS) in Dresden since 2018. His main focus areas are deep learning for language and text, reinforcement learning, and security. He and the MMS team turned the artists’ innovative ideas into reality and brought the AI to life.
Here, it’s actually not quite correct to say just ‘one’ piece of AI – several were involved, all in all. Librettist Christiane Neudecker worked with GPT 2 and 3 to develop texts, while another learned how to read music, and another how to sing. A team from T-Systems MMS was heavily involved in developing the singing AI.
“We didn’t start completely from scratch with our opera singers. We used research results from the area of text-to-speech, primarily the work of Chen et al. (‘HifiSinger’, 2020) and Liu et al. (‘DiffSinger’, 2021), which converted a text-to-speech system into a singing voice synthesis system,” says Westerbeck. As the project progressed, too, the search for up-to-date findings remained a constant companion for the MMS team – after all, singing AIs are new territory. Westerbeck dug through dozens of publications in order to find the key to help make a piece of AI sing.
But there was quite a way to go from ‘Hello, how can I help you?’ to ‘Hell's vengeance boils in my heart’ – and a great deal of code had to be written too. How do you describe a singing voice in precise detail? Language, and moreover singing, are too complex. This is especially true when it comes to mapping them digitally. A typical music file of 44 kHz contains 44,000 individual sound pressures in one second. How many words can a person speak or sing in this time?
American actor Eddie Murphy sounds like he’s speaking 50 words in Beverly Hills Cop, but in an opera, it’s perhaps five. How do you distribute these 44,000 pulses across five words (and the different notes)? Where does each phoneme – the basic units of sound in language – begin? It’s a real puzzle, and a complex one at that.
The MMS team decided to take a pragmatic approach. The AI was to learn from a model. Kling klang klong invited Eir Inderhaug, who would play the role of the real self in the opera performance, to the studio in Berlin. There she initially sang 50 children’s songs, which were digitalized and sent to T-Systems MMS. Why children’s songs? “One of the publications we read recommended children’s songs,” explains the AI specialist. “This was an effective approach too. However, at a later stage of the project, it became clear that we wouldn’t achieve our goal this way. Rudolph the Red-Nosed Reindeer is no operatic aria – even when it’s sung by an opera singer.”
A second visit to the studio was therefore necessary. Inderhaug had to go one better, singing 20 operatic arias, which ultimately provided 10 more minutes of training material.
In the end, 70 songs were used by the AI team as a data source, intended to show the AI how singing functions. “We then had a sufficiently wide spectrum of data to avoid overfitting.”
The AI experts at T-Systems MMS then developed a piece of architecture for a neural network, which is able to record notes and texts as input and generate a sound output from them. The team decided to spread out the complexity of the tasks in the neural network and generated a pipeline of multiple neural sub-networks. “In a few years, it’ll perhaps no longer be required, but the complexity we were faced with made this strategy necessary,” admits Westerbeck. These networks were initially as musical as a housefly on the hunt for food.
At least that’s what the first result sounded like. This isn’t surprising, as the first parameters for the neural network were initially created by a random generator.
The AI was trained in so-called epochs. Within an epoch, the AI was shown the complete dataset – which had already been split up into 10,000 snippets. In this way, three million ‘training sessions’ were provided over a total of 300 epochs.
During each process (‘forward pass’) within a training ‘epoch’, the AI was scrutinized at the end. The AI ‘aria’ was compared with the professionally sung version by Eir Inderhaug. This ruthless loss review assessed the performance of the model. The quantified results were then played back automatically into the neural network, which adjusted the originally random parameters.
“At the start of the training, a neural network makes huge errors. That’s not unusual,” explains Westerbeck. “The purpose of the training is to reduce these errors and gradually continue to improve.” The neural network increasingly started to recognize sounds that were too loud or too high, and learned from this so that it sang better on the next attempt. The second product (after 10 epochs) first sounded like singing. “A bit like a radio that isn’t tuned to an exact station, or secret messages from outer space which have been extracted from cosmic background noise,” smiles Westerbeck.
But of course this still wasn’t enough for the operatic artists. Eule sums up the development of the AI – and you can tell from his words the slight dismay from the classic operatic artists at the beginning of the project: “When we heard the first singing samples, we were unsure as to whether we really wanted to venture into this experiment,” says dramaturge Eule.
Indeed, if you think that the AI would now only ever deliver the same version of the song, you’d be wrong. The neural network is dynamic and is always slightly changing its playback. Neither listeners nor vocal coaches are able to say what they will hear exactly, just that an ‘audible’ result will be created that suitably reproduces the played notes and the text. The neural network will provide ‘just’ an optimized result, which is always slightly different from the original – and rightly so. Whether a singer rolls an R or hisses an S is entirely up to them. The same is true for the AI. It is perhaps a bit more human in this regard than one might think.
But the makers of the opera wanted to go one better. Not only was algorithmw Ego Fluens to sing songs, for which it received the (previously unknown) notes and text from the score, a 4-minute passage was also provided in which it was supposed to improvise. An extemporization. For this passage, the AI received no texts or notes from humans. Thanks to the work of the team at kling klang klong, it was possible to combine the AI language model GPT-3 with a note composition model, providing the synthetic vocal model (the singing AI) with different texts and notes for each performance.
The singing AI received these during the performance – those responsible checked first that the texts did not contain racist or sexist content You never know with AI...
The audience of experts shouldn’t have noticed any differences in the passages in which text and notes were fixed beforehand and the AI followed the score and the libretto. “Here, the result within certain parameters was quite clear – and we were quite relaxed,” remembers Westerbeck. And while the AI didn’t deliver world-class soprano-level singing, it sang easily understandable texts with
the right notes.
However, the project team was much less relaxed when it came to the live experiment of the ‘improvisation phase’. Would the AI do well here too? Every human actor needs a certain amount of talent for improvisation – the Marx Brothers’ scripts, for example, were said to sometimes contain the stage direction ‘Harpo does something funny’ – and he always delivered. “Even though we’d prepared the AI thoroughly, there was no guarantee. We crossed our fingers that it wouldn’t do anything funny either,” admits Westerbeck.
Then it was time for scene 5. The human actors lay down, the red light came on: the stage was free for the AI to sing its solo aria. And it did a great job here too. GPT-3 developed a suitable text, the note model a suitable melody for the harmonies, and the singing AI translated7 everything into song. The AI developed a rather classic model, which differed quite significantly from the rest of the very experimental opera. The AI, therefore, did very well in the improvisation section as a composer too. It processed the freshly generated text and the unknown notes in real time into a live aria.
The experiment was a success; the opera AI had proven itself. Even the libretto impressed certain critics. Would you like to hear a sample? And if the AI assures you that ‘I am so much more than a machine... My heart is just a cold hard drive,’ it really does seem like it is worried about its own existence.