Procedural Audio Generation


Procedural Audio Generation

this is part of a technical writeup, for the initial concept, see Jam Entry

The game felt a little too plain with just text boxes and multiple-choice options. In a more classic board game, you would have players read action cards aloud. This would give games a bit of flavour.

So I wanted to make the game as close to fully voice acted as possible.

Obviously, hiring voice actors is completely impractical for a game jam. 

Being a programmer, I decided to stick with the profession and use code to generate audio.

To do this, I leveraged an open-source application called "MARY TTS " for voice synthesis.

This works in 3 parts, the script engine of my game would decipher the text to be shown, hash the text into a unique key, and feed it into the synthesis software. This would produce a .wav file which I could bundle with the game. At run-time all I needed to do is re-compute the hash and play the related .wav file back.

Below is a sample of the text that is output and the corresponding hash

  • audio snippet text: 
  • a12f604a
  • After a long day player Dungeons & Dragons You find yourself bleary-eyed You consider driving home...

Overall, I am quite happy with how this turned out. The voices were much higher-quality than I expected, and I alternated between 2 different voices to try and get some variance that I would never have been able to achieve if I was to attempt voice acting it out myself. Additionally, generating the audio is much faster than attempting to record it all myself. Recording would necessitate consuming as much time as there is dialogue, skipping this step programmatically was a huge time save.

However, there was one big problem with this approach. Maybe you can notice it in the text above? Typos!

The text here should say:

  • After a long day playing Dungeons & Dragons. You find yourself bleary-eyed. You consider driving home...

Typos get dictated verbatim using voice synthesis, and since it can't see newlines, missing a full-stop can cause the text to run together.

The problem is that since the audio is as a hash of the text, even a single character change will make a new hash. This means every small change in the script results in large amounts of new files being output. The only way to identify which files are new/old is to re-compute the hashes for each version. 

To this end, a large number of typos which were found, and kept in the game. Some of them are fixed in the audio output (with the text retained to preserve the hash). This is an unfortunate consequence, but not game-breaking, so it was an adequate trade-off to save considerable time.