All I Want For XMAS is a Wrapper Around Your API

Or how to scrape the web for audio files when text to speech won't do

Dec 08, 2023

When an Arabic word pops up in my game, I want to hear it pronounced - not by some text to speech application, but actually pronounced properly.

The learning loop here is the word pops up, there is a pause, I say out loud how I think the word is pronounced, then it plays the audio of its actual pronunciation, then I say it again, then I type it out and it shows me the translation. I have to fetch this audio from somewhere, preferably an automated process.

There is a site called Reverso which is great for this! They have all kinds of words in all kinds of languages pronounced live.

I loaded the page, hit F12, and looked at the network signals.

206 is the status for “Partial Content” which in this case means streamed audio. Basically the client has requested some range of data, and the server response “yep, here is the chunk you asked for”. The response carries an important range header.

Here you can see the range value is bytes=0- which means basically everything from 0 onwards. This makes sense because it is a very short audio recording and the whole thing plays (when you click the little speaker icon on Reverso it reads the whole word. There is no pause functionality).

How can I isolate this command and inject it into my Godot game?

Isolation

Right click the network request, hover over copy, and select “copy all as HAR”

The HAR file is a snapshot of the exact request used to get that audio resource. To take a closer look at it, use Postman or some similar HTTP client. Postman is a bit of a bore and makes you login to an account if you want to upload a request in any other format than cURL. But we overcome.

If you are signed into Postman you get this page. Which is where you drop you HAR file. (You might have to paste your copied HAR format into a file called like “test.har” and then upload that.)

We get a BUNCH of stuff from this (depending on how long your session was on the Reverso site). But I filtered by “voice” or some such and it narrows down the recorded requests. Eventually I picked out one that looked right.

The biggest thing I did here was realize that inputText was probably base64 encoded. But boy did I feel like an absolute genius.

And the headers ….

Oiginally when I tried this, I stripped out all the headers, but you need to leave the user-agent. I’m not sure why, but I know Reverso uses ASP.NET so I’m guess it’s some setting in there to not accept requests from sources that don’t have a user-agent. Presumably because they only want to respond to requests from browsers.

So now we can change the inputText to be whatever base64 encoded Arabic word we want and it will return to use an audio stream of it being pronounced!

Godot Implementation

I didn’t want to make a ton of requests for each word in a vocabulary - because that would probably get us shut out of the Reverso servers. So I made it so that if a given word does not have an audio recording of it saved, then it will be fetched only when the word is presented to the user for the first time.

The first time a word is presented in practice mode

func fetch_arabic_word(word: VocabularyWord):
	if word.audio.is_empty():
		var encoded_word = Marshalls.utf8_to_base64(word.arabic_word) 
		var final_url = url + params + encoded_word
		arabic_word = word.arabic_word
		request(final_url, headers)
	else:
		play_audio(word.audio)

The word.audio.in_empty() checks if there is already a packedByteArray associated with the word. If not, then it encodes the Arabic word in base64 and slapts it onto the final_url. The line where arabic_word is set to word.arabic_word is me setting a value on this larger class because I’m a lazy programmer and I need it later.

Then we create some functions to fire when the request(final_url, headers) function is called.

func callback(_result, response_code, _headers, body: PackedByteArray):
	if response_code == 206:
		play_audio(body)
		VocabularyManager.add_pronunciation(arabic_word, body)
		
func play_audio(sound: PackedByteArray):
	var stream = AudioStreamMP3.new()
	stream.set_data(sound)
	audio_stream_player.set_stream(stream)
	audio_stream_player.play()

The whole thing looks like this:

extends HTTPRequest

const url = "https://voice.reverso.net"
const params = "/RestPronunciation.svc/v1/output=json/GetVoiceStream/voiceName=Mehdi22k?voiceSpeed=100&inputText="
const headers = [
		"accept: */*", 
		"user-agent: values values values",
		"range: bytes=0-",
		"origin: https://context.reverso.net",
		"accept-language: en-US,en;q=0.9",
		"accept-encoding: identity" 
	]
@onready var audio_stream_player = $"../AudioStreamPlayer"
var arabic_word: String 

func _ready():
	self.request_completed.connect(callback)
	
func fetch_arabic_word(word: VocabularyWord):
	if word.audio.is_empty():
		var encoded_word = Marshalls.utf8_to_base64(word.arabic_word) 
		var final_url = url + params + encoded_word
		arabic_word = word.arabic_word
		request(final_url, headers)
	else:
		play_audio(word.audio)

func callback(_result, response_code, _headers, body: PackedByteArray):
	if response_code == 206:
		play_audio(body)
		VocabularyManager.add_pronunciation(arabic_word, body)
		
func play_audio(sound: PackedByteArray):
	var stream = AudioStreamMP3.new()
	stream.set_data(sound)
	audio_stream_player.set_stream(stream)
	audio_stream_player.play()

Michaela’s Substack

All I Want For XMAS is a Wrapper Around Your API

Or how to scrape the web for audio files when text to speech won't do

Isolation

Godot Implementation