Several years ago I heard Alan Kay give a talk at the MIT Media Lab about how computers and the internet are transforming human communication, succeeding speech, writing, and the printing press before it as the primary means of pushing ideas around. By way of simile, he continued, if the computer is like writing, and the internet is like the printing press, we are living at a time after the printing press, and before the Enlightenment — when ubiquitous access to books in Europe (by at least the wealthy elite) made for a startling period of rapid intellectual advancement. His talk has always stirred strong feelings in me because it made me want to know what new ways the computer would be used for communication.

Then the other day, I realized that our hands might be very bad inputs for the computer and by Kay’s simile, human communication. Here’s what I mean. In broad strokes, a computer to you and me is a machine that changes makes pictures when we move a mouse or tap our keyboard or more recently just tap directly on the screen. So, in a very real way they are machines which convert manual dexterity into pictures. As an alternative, consider a machine that turns our voices into pictures.

How would a computer like this work? Well, besides a microphone to record your vocalizations, it would need a way to locate where on the screen you are looking — a technology called gaze tracking which works fairly well at this point using a special box that sits underneath your screen and looks up at your face and eyes. Using gaze tracking and a microphone you could actually simulate much of what the mouse does using your voice. The gaze tracker would track your eye position on the computer screen and then when you speak or sing or vocalize, the interpetation is that you spoke to the screen at that location where you were looking.

For example, to surf the internet, you would turn on your computer (perhaps with you voice!) and look at the google application icon and say “open” and then you would look at the search box and say “search for plant nuseries”. Upon seeing the results you might scan with the results of the search with your eyes and fixate on the result a third from the bottom, saying “open”. So your gaze serves to locate where you are speaking and also helps disambiguate what you want to do. But to me, the speech recognition part is not the interesting part, and it’s still not really using voice to make pictures… not really. To me it’s using voice to do some poor imitation of writing.

A more substantial departure from manual input is to use your voice in all of its wonderful tonal complexity (it’s timbre and pitch and volume) to literally make pictures. For example, you could paint with your voice by looking at a particular part of the screen and singing and gradually moving your eyes across the screen, modulating your voice simultaneously. The timbre of your voice would modulate the brush tip, the pitch could modulate the paint color, and the volume could modulate the brush size.

You see, one amazing thing about your voice is that it can transmit information as sound waves beyond your body. This mirrors in some sense the amazing thing about your eyes, namely that they can receive input from beyond your body in the form of light waves. Your hands are not as good at transmitting or receiving information past your body, although you can learn to make and play instruments which accomplish the same task that in some sense your voice does “out of the box”. So there is this very natural affinity between our ability to make sounds with our voices and transmit them as waves and our desire to see information as pictures. The translation between these can be done with the computer.

More about this later. I hope to do some real experiments and report back here.

3 COMMENTS
ALec
March 23, 2011
ad

I couldn’t find a good link right on hand, but Karan Signh, Andy Nealen and others work on a similar (perhaps orthogonal) problem of defining and providing what they call “Human video-out”. The way they usually put it is similar to what you write. We, humans, have audio-in and audio-out. We can receive and emit sound waves, but for video we can only receive. We only have a video-in, but no implicit video-out.

panda
March 24, 2011
ad

I love this. run with it.
can’t wait to paint with sound!

panda
March 24, 2011
ad

I just saw this, and made me wonder what kinds of surfaces one could interact with using sound. buildings? http://www.thecoolhunter.net/article/detail/1913/adidas-france–raising-the-bar-in-3d-mapping-projection

Post a comment

*