Voice Assistant File Transfer Protocol: Encodes files verbally and transfer them through the Alexa voice assistant

Posted at — Jan 2, 2021

By Richard Audette, richard@hotelexistence.ca

Introduction

I have created a Voice Assistant File Transfer Protocol for Alexa-powered voice assistants, like the Amazon Echo. The protocol uses verbal encoding to transfer the file through Alexa - it is not the control an external file transfer application. I have developed a proof of concept, which includes a client application, a server application, and an Alexa Skill. The client encodes a binary file as English words, launches an Alexa Skill and verbally sends the data through the PC’s speakers to an Amazon Echo device. The server decodes the words, saves the binary file, and makes it available on the Internet.

File verbally encoded, transferred, and served

Background

In November, I received an email from Amazon informing me I was automatically opted-in to “Amazon Sidewalk”, unless I chose to opt out. Amazon Sidewalk is a mesh network that allows devices participating their Sidewalk program to connect to the Internet through Amazon devices, like the Amazon Echo. The interesting thing is, your Internet connection is being made available to traffic that’s not yours - the mesh is being made to all devices participating in Amazon Sidewalk. Amazon stresses that the bandwidth used by Sidewalk is minimal, and they have taken measures to ensure it is secure.

This technology interests me - I love the idea of mesh networks. I have previously tried to build a mesh network in my neighbourhood, but was unable to attract enough interest from my neighbours.

As I read about concerns on various technology forums about opening your Internet connection to traffic that wasn’t your own, I thought that essentially, an Amazon Echo does this already - anyone within the listening range of an Echo device is essentially using your Internet connection - albeit, in a very restricted capacity over a limited range. I could design a skill with the express purpose of getting data to and from a third party, through an Amazon Echo. What would this look like? How slow would it be?

Design

I decided to target the Alexa voice assistant, as it is the one I’m most familiar with, and the Amazon Sidewalk program sparked the idea. My first thought was to encode data like a telephone line modem, as home computers of the 1980s and 1990s did when connecting to other computers. It looks like this could be possible with the Alexa.RTCSessionController interface, but I wanted something simpler.

I started thinking about efficient ways of encoding binary data with english words, and stumbled on a thread on stackoverflow.

Q: Is there any good way to “encode” binary data as plausible madeup words and back again? To give you a very simple and bad example. The data is split in 4 bits. The 16 possible numbers correspond to the first 16 consonants. You add a random vowel to make it pronounceable. So “08F734F7” can become “ba lo ta ku fo go ta ka”.

The top-voted answer suggests it might be possible to pack 3-6 bits per letter, using unique sounding syllables. I started to think about implementation, and reviewed how my verbal encoding would fit in with the Alexa input, or “slot”, types, permitted by the API. Surprisingly, dictation does not appear to be a capability. I started thinking about what slot type would be best:

Does the AMAZON.CITY recognize 256 cities so I could I pack a byte (0-255) per city? ie: Toronto = 0, Chicago =1
Assuming the average city is 3 syllables, could I do better than three syllables per byte?

I ultimately decided to ignore the optimization problem, and just work towards a working proof of concept. I defined the protocol as follows:

The client launches the skill: “Alexa, launch upload file skill”
The client then sends the utterance “Add byte” and a series of 1 to 4 digit numbers.
1. Each digit is sent seperately, and the numbers are separated by pauses. Eg: 1 2 3 4 is 1234, whereas 1 2 {pause} and 3 4 {pause} are 12 and 34.
2. The client first sends the filename, encoding each character (support is limited to ASCII) as its decimal value + 1000. So values between 1000 and 1255 are interpreted by the server as the filename.
3. Values between 0 and 255 are interpreted by the server as a byte in the file
4. 2000 is interpreted by the server as an end of file and end of session

The protocol in its current state is very fragile. In my testing, Alexa frequently misinterpreted the number spoken by the client application. Initially, I used numbers I recorded, with my own voice. Even when I normalized volume levels, and optimized the timing between samples, I still encountered interpretation errors from Alexa. I tried the Festival Text to Speech engine, which is capable of speaking whole numbers (eg: one-thousand three hundred and two) rather than digits (one-three-zero-two), but results were also poor. I settled on samples I obtained from the Google search results: “How to pronounce {number}” and obtained slightly better results. The protocol could be enhanced with a checksum. Alexa could respond with “resend” when the checksum was invalid, and the client could re-send the previous byte before moving on to the next byte.

Code

The client and server are implemented in NodeJS. The code can be downloaded from: https://github.com/raudette/vaftp

AWS Skill

The code in this proof of concept requires the creation of a custom AWS skill with the following properties:

invocation: upload file
utterance: addbyte {number}
slot type for number: AMAZON.NUMBER
end point: The HTTPS endpoint running the vaftp-server.js code

For complete details on creating a skill, see Amazon’s documentation.

Server

The server application is vaftp-server.js. The server is configured to run on port 8080. I suggest running on a server, with port 443 exposed to the Internet, running Apache 2 with a valid certificate, configured to proxy port 443 to 8080. On an appropriately configured server, run ’npm install’ to install the dependencies and launch the server application as follows:

node vaftp-server.js

The server makes all uploaded files available in the ‘/files/’ subfolder: https://{yourserver’s domain}/files/

Client

The client application is vaftp-client.js. It uses the play-sound Node library, which required a command line audio player. I suggest installing mpg123. Run ’npm install’ to install the remaining dependencies.

To send a file to the server, place the client computer near an Amazon Echo, ensure the volume is turned up, and run as follows, where filename is the file to be sent:

node vaftp-client.js <filename>

Demo

This is a demo of the system transferring a text file called ‘x.txt’ containing the single letter ‘x’ in it.

Security Concerns

The privacy concerns of using a voice assistant have been widely discussed in the press - I think the New York Times’ Wirecutter post on Alexa provides a balanced assessment, I won’t expand further on those points here. I have spent some time trying to think if using a voice assistant to exchange data brings any new concerns to the discussion. Few would host an unsecured wifi Internet access point in their homes or on their phone, yet as demonstrated here, our voice assistants provide this capability, in an awkward, limited fashion.

As the logic for third party Alexa Skills can reside on third party servers, Amazon cannot audit the code, as they can with mobile apps on the app store. In Turning an Echo Into a Spy Device Only Took Some Clever Coding, Wired describes how Checkmarx demonstrated an Alexa Skill that could transcribe conversations overheard by a device. The article states that Amazon has some mitigations in place:

As part of expanded defenses, the researchers say that Amazon is now controlling empty prompts more carefully, screening for this type of eavesdropping functionality when it evaluates skills for its store, and cracking down on unusually long sessions. A company spokesperson told WIRED in a statement that, “We have put mitigations in place for detecting this type of skill behavior and reject or suppress those skills when we do.”

The conversation patterns used by this “Upload File” skill would be easy to detect, and the limit on long sessions would certainly limit how much data could be transfered. An implementation with the Alexa.RTCSessionController would be harder to detect.

I have concluded that, although this technique is interesting, it is too limited to be a cause for additional concern:

Any person or device within listening range of a voice assistant likely has better ways to transmit data.
The data transfer speed is so slow that it would limit how much data could escape undetected
Although the voice assistant here acts as an unprotected limited access point to the Internet, it does not provide access to other devices that might be on the same network as the voice assistant, as an unprotected wifi access point could.

Proof of Concept Limitations

This proof of concept represents a minimum viable implementation. The most significant limitations are:

Data transfer speed. In testing, I achieved 1 bit per second. For comparison, a dial up internet connection can reach 57,600 bits per second.
The server cannot handle multiple sessions - only 1 user at a time is supported
There is no error correction, check sums, or error handling. If Alexa misinterprets the client, the server can end up in an indeterminate state or create a file with errors. I saw very high error rates in testing.
The client doesn’t listen for Alexa to respond. Currently, actions are dependant on timing. If Alexa is slower to respond than my test setup, the proof of concept will not work.

Conclusion

It is possible to implement a protocol to send data through a voice assistant, without being impeded by any security measures. As voice assistants are in many homes and enabled on many phones, this could be a way to transfer data when other means aren’t possible. It is likely not a cause for security concern, as it is limited by listening range, which typically implies physical access and better alternatives for data transfer. The protocol as presented here is too slow for any conceivable use. An implementation built with the Alexa.RTCSessionController interface could potentially be as fast as dial-up Internet connection, which would be fast enough to be practical for sending data, such as a Word document.

Please email richard@hotelexistence.ca with any feedback you have regarding this article.

Articles by Richard Audette

Long form articles which don’t suit my blog