Is it possible to convert video information to sound in order to help those suffering with visual impairment?

Oliver Britton
EPQ project

This is an interactive document exploring different approaches to using "sensory substitution" to convey visual information to those suffering with blindness. This involves identifying some key requirements and limits for the technology, looking at existing approaches to the problem and presenting new some approaches for consideration. You may be asked to enable webcam in order for some of the interactive examples to work.

Introduction

The process of converting images to sound is the idea of sensory substitution, where you take information that would normally come through one sense and replace it with a new way of conveying it through another sense. This is normally due to the original sense being impaired or faulty, which is why sensory substitution's main use case is for developing technology for disabled people. Braille is probably the most famous form of sensory substitution by restoring the ability to read by through tactile stimulation.

Sensory substitution devices consist of three components (Lenay et al., 2003):

A sensor, which records the information that would normally be coming to the sense being replaced,
A coupling system, which converts the information from the sensor into something that can be interpreted by the new sense, and
A stimulator, which delivers the information to the new sense.

This essay explores the coupling system for a sensory substitution system with a camera as the sensor and headphones or speakers as the stimulator. The coupling system here is interpreted as a conversion algorithm which converts between representations of images into representations of sounds.

How do you represent images and sounds?

In order to create an algorithm for converting between images and sounds, we need a standard digital representation that allows for efficient manipulation. There are many different ways of representing images digitally, but the easiest format to work with is a standard 24-bit bitmap image. In this representation, an image is stored as a list of pixels ("picture elements") which contain three values corresponding to the proportion of red, green and blue at a certain point in an image (Leonard, 2016). Each color is allocated 8 bits, which allows for the numbers 0 through to 255 to be represented, where 0 is none of the respective colour and 255 is the maximum possible amount of that color. For example:

\((255, 0, 0)\), only red.
\((211, 6, 184)\), lots of red and blue but not much green.
\((0,0,0)\), completely black.

Sounds, on the other hand, are a little trickier to represent. Because sounds are analog opposed to digital, it's hard to describe the exact shape of a sound wave on a computer. One way around this is to leverage the fact that every sound can be represented as the sum of sine waves, and store the constituent sine waves that make up the sound up to a cutoff.

This means that we can represent sounds as a list of sine waves, just like how we can represent an image as a list of pixels.

Limits on this approach

There is a fundamental limit on using sensory substitution to convert images to sound: you can fit much more information through the eye then you can through the ear in the same amount of time. Current estimates suggest that the human eye can transfer roughly ten million bits of data to the brain per second (Koch et al., 2006), whereas the ear can transfer a maximum of one hundred thousand bits per second (Markowsky, 2019). For this reason, there has to be a large trade-off where a majority of the fine detail in an image is removed in order to save bandwidth.

It is also worth considering the abilities of a computer when converting the images into sound. While a computer can operate on vastly larger amounts of data than the conscious human brain (Markowsky, 2019), a potential conversion algorithm could be so complex that it couldn't keep up on relatively small hardware (which may be necessary for any real-world application as hauling around a large computer wouldn't be very practical). Moreover, computers present an additional reason to reduce the details in the image due to the quadratic nature of looping through each pixel individually.

Overall, two trade-offs need to be made for this approach to work. The first is that we need to reduce the sheer amount of data (Dimensionality Reduction) going into a conversion algorithm and the second is to ensure that the procedure for converting images is simple enough that a computer can perform quickly and efficiently (Computational Efficiency). Furthermore, we also need to ensure that the data is human-interpretable (Human-interpretability) so that a person using a device can understand what they're hearing.

Dimensionality Reduction

Dimensionality reduction is the process of shrinking down data in a way that still preserves useful properties despite the reduced size (Box et al., 2009). For example, sounds are compressed on a computer so that they still have the useful properties that humans expect (pitch, rhythm, timbre) but don't take up as much space.

This is important because it both reduces the amount of work that the computer needs to do and makes it easier for the human to understand — there's no point sacrificing the clarity important details of an image in order to make sure the irrelevant ones are also carried preserved.

But how do we do this in practice? Here are a few approaches you could take:

Reducing the resolution of the image. While it would be appealing to have as sharp an image as possible, the data transfer rate of the ear compared to the eye means we need to make big sacrifices in the resolution of the image. Reducing the resolution would be the same as taking the photo with a blurrier camera: if we originally had an HD photo of \(1920 \times 1080\) pixels, we can get away with converting just a \(240 \times 135\) image and still being able to recognise the image.
Giving less space to colors. The representation for images presented above gave a range of 0 to 255 for the different components of red, green and blue, represented by 8 bits. If instead we only gave 4 bits to each pixel, we could represent 0 to 16. The image would be much poorer quality but also much smaller.
Get rid of colors entirely. Even more drastically, we could get rid of colors entirely and have each pixel show only the overall brightness at each section of an image. Since we'd only need one value representing overall brightness instead of three at each pixel.
Only show where the edges are. Many images are recognisable from just the outlines of shapes alone. This gives us another opportunity for drastically reducing the size of the image.
Splitting the image up. Another approach would be to run the image conversion scheme on the separate parts of the image and play the sounds one after another. This would reduce the rate at which a device could react to a new image, but would mean that a lot more detail could be carried across but just in a larger amount of time. For example, considering each vertical strip of the image separately one after another could turn the input from a \(240 \times 135\) image into a just a \(1 \times 135\) column.
Using visual descriptors. When training object detection algorithms, a common approach is to use a series of "visual descriptors" rather than the actual raw image itself. For example, an algorithm might convert an image into a histogram of edge gradients (a "HOG") which will be similar for images of the same object. This means you can massively compress images down from millions of pixels to a few thousand visual descriptors.

Typically, a conversion algorithm will use several steps of dimensionality reduction (Meijer, 1992) such as reducing the resolution of the image, using only brightness values and splitting up the image into strips in order to reduce the work for both the human and the computer as much as possible.

Human-interpretability

One careful consideration needs to be made for whichever techniques of dimensionality reduction are used: the data still needs to be interpretable by humans. A computer might be able to understand a super concise representation of images generated by an autoencoding algorithm, but it might be impossible for humans to understand.

Modern compression algorithms can reduce the size of an image by a large factor without sacrificing any quality, in a procedure known as "lossless" compression. This can involve techniques like identifying redundant or repeated data, or shortening down the amount of data required to describe a small section of an image (Kiran, Mounika and Srinivas, 2018).

Lossless compression algorithms can not be used by sensory substitution devices because the data that comes out of these algorithms is not human-interpretable at rest. When a computer needs to display the image, it first has to uncompress it so that it is available to display on the computer screen. This means that while compression is useful for communicating between computers, it's not useful for communicating from a computer to a human.

Care needs to be taken that the method used for converting the image to sound (rather than reducing the dimensionality) is also human interpretable. To do this, the features of the image that once corresponded to brightness or color needs to be mapped onto features sound. A few ways this could be done include:

Brightness \(\to\) Amplitude (loudness)
Color \(\to\) Frequency (pitch)
Edge orientations \(\to\) Timbre (how an instrument sounds)

However, this can still go wrong:

Changes in amplitude might be so small that there is no discernable difference between two sounds that are supposed to convey different information.
Frequencies far outside the range of human hearing (\(20\text{Hz}\) to \(20\text{kHz}\)) (Purves et al., 2018) may accidentally be used. Although this would mean there was more information in the sound, it would be useless to a human as they wouldn't be able to hear it.

Simplicity

An important part of designing a conversion algorithm is making sure that it doesn't introduce any unnecessary complexity that will make it harder for a human to understand. Even if you were to use features of sound that a human could recognise, if the process of turning the image to sound included unnecessary steps, it would put a larger burden on the person using the technology to understand and interpret what they are hearing.

This links back to why using a lossless compression algorithm would not be a good idea because information is encoded using a complicated set of mathematical and computational operations that a human would struggle to do manually in a short amount of time.

Existing Technologies

So far, a few requirements and limits on how a system should convert images to sound have been identified. Before presenting a few new techniques for consideration, it is worthwhile to look at a few existing technologies that already exist.

The vOICe

The vOICe (oh-I-see) describes itself as "augmented reality for the totally blind" (Meijer, n.d.) and has had real-world use since its conception in 1992. Not only has the project developed accessible and open-source technology for this purpose but the organisation has published extensive research and literature reviews about the topic. The algorithm at the core of their system is summarised by the following diagram:

Dimensionality of video data is reduced by converting the image to black and white, down-scaling and splitting the image up into vertical chunks. Each vertical component of the image is then assigned to a sine wave of a particular frequency who's amplitude is modulated by the relative brightness of that particular pixel. Each column is then converted and played in sequence (Meijer, 1992).

This conversion scheme makes the trade-off of increasing the amount of detail carried across in an image by increasing the amount of time it takes per frame. However, as a user gets more advanced at identifying objects, they can increase the speed of transmission to a level they find comfortable. This technique of allowing customisation makes the vOICE one of the most practically useful and adoptable technologies of its kind. Users of the vOICe can navigate urban environments, exercise outdoors and identify specific objects like a dropped cane or book.

TapTapSee

TapTapSee is an open-source mobile app that takes a different approach to the vOICe and chooses to offload the work of recognizing specific objects in a scene to a computer and then having these reported back to the user via speech (Taptapseeapp.com, 2019).

In this example, a blind user has pointed their phone at a flowerpot and wishes to know what type of flower it is.

>

By leveraging the an external image recognition API, the app is able to recognise that it is a "dieffenbachia"/"dumb cane" and report it back via text-to-speech. While not as broadly applicable as the vOICe to everyday life and navigation, this has the advantage of complex objects and even text that a system like the vOICe would struggle with.

Conversion Algorithms

This section outlines some potential algorithms for converting images to sound. While not as practically useful or complicated as the algorithms present in projects like the vOICe or TapTapSee, they could provide an interesting starting point for further research and consideration. All of the algorithms described below are available to try interactively at the top of this document.

Brightness to Frequency

Perhaps the simplest of all possible algorithms, the brightness to frequency procedure could be summarised as follows:

Take input image

Reduce width and height

Find the mean value of the RGB component of all pixels

Create a sine wave with a low frequency for low mean and a high frequency for high mean

This creates a mapping between how bright the overall image is and how high the frequency of the produced sound, allowing you to "hear brightness". In this example, the input image is a photo of a Rubik's cube on a lined sheet of paper:

As you can see, the process of calculating the overall brightness removes any colour present in the image. Specifically, the overall intensity of the image is detected to be \(0.61\) out of \(1\), and so a sine wave is created 61% through the interval \([200\text{Hz}, 1000\text{Hz}]\), roughly corresponding to a width of 4 octaves centered around to F4 on a piano. In pseudocode, the algorithm might look like this:

          
          let totalBrightness = 0;
          
          for (var y = 0; y < video.height; y++) {
            for (var x = 0; x < video.width; x++) {
              totalBrightness += image.pixels[x][y].r;
              totalBrightness += image.pixels[x][y].g;
              totalBrightness += image.pixels[x][y].b;
            }
          }
    
          playFreq(
            map(
              totalBrightness,
              // Total possible brightness would be if all
              // pixels were 255 in R, G and B:
              0, video.width * video.height * 255 * 3,
              // Convert to between 200Hz and 1000Hz.
              200, 1000
            )
          )

RGB to Frequencies

The idea of converting intensity to pitch can be extended by also accounting for the individual brightness of each red, green and blue component of an image. This is done in a similar fashion to mono-chromatic case:

Take input image

Reduce width and height

Find the mean value of the red component of each pixel

Find the mean value of the blue component of each pixel

Find the mean value of the green component of each pixel

Find the ratio of majority red to green to blue pixels

Create three sine waves, with the amplitude and frequency of each corresponding to this ratio and brightness

For a practical example of this algorithm in action, here is the example of a Rubik's cube again but this time with each colour channel treated separately:

The additional colours around the outside are an artefact of how white is the colour equal in red, green and blue. If a white object like the sheet of paper in the above image is ever so slightly more red than it is green or blue, it will be treated as red in the visualisation. While this appears to have a drastic effect in the above image, the outputted sound will not be significantly effected as all colours are being distorted roughly the same amount which only leads to a distortion in the calculated amplitude of each sine wave, rather than the frequency (which is much easier to identify).

In the final sound, there will be three groups that will show up on a spectrogram:

The black line represents the overall wave formed from the superposition of the harmonics and the three vertical columns represent the red, green and blue channels respectively. This technique would allow a visually impaired person to "hear" colour, such as being able to discern what type of fruit something is without having to feel it. In pseudocode, it might look like this:

        
        let totalR, totalG, totalB = 0;
        
        for (var y = 0; y < video.height; y++) {
          for (var x = 0; x < video.width; x++) {
            totalR += image.pixels[x][y].r;
            totalG += image.pixels[x][y].g;
            totalB += image.pixels[x][y].b;
          }
        }
  
        playFrequencies(
          map(
            totalR,
            0, video.width * video.height * 255,
            200, 400
          ),
          map(
            totalG,
            0, video.width * video.height * 255,
            400, 800,
          ),
          map(
            totalB,
            0, video.width * video.height * 255,
            800, 1600,
          ),
        )

Histogram Harmonics

While using the red, green and blue channels to form harmonics allows a visually impaired person to hear colour, this conversion algorithm would allows a visually impaired person to hear orientations. The process looks like this:

Take input image

Reduce width and height

Apply a Sobel filter to the image

Use finite differences to calculate an estimate for edge orientation

Create a histogram of gradients (HOG), binning pixels in the image according to their edge orientation

Run an Inverse Discrete Fourier Transform (IDFT) on this histogram to create a family of sine waves

Informally, it's like looking at the angles of all the edges in an image and then plotting a histogram showing which angles are the most likely. This distribution is then used to create sine waves, so encoded in each is information about which way edges are facing. In some ways, this technique is less useful than the colour harmonics conversion scheme. Here is the algorithm run an image of a Rubik's cube:

Due to the application of the Sobel filter (which is used for detecting edges), the details of the cube itself have become almost entirely unrecognisable. Here, each white line represents the calculated orientation of each edge. However, on some images, this is more useful. Here is an image of a flat, horizontal black line:

Using the colour harmonics approach, this image would be unrecognisable from this image:

However, there is a clear difference in the two image's resulting sound when running through his algorithm, as can be seen from these two spectrograms:

The pseudocode for this algorithm is more complex than for the other two, but could be summarised as so:

        
        applySobelFilter(video)

        let total = 0;
        let bins = [...];
        
        for (var y = 0; y < video.height-1; y++) {
          for (var x = 0; x < video.width-1; x++) {
            let brightness = image.pixels[x][y].brightness
            
            let finiteDiffX = image.pixels[x+1][y].brightness
            let finiteDiffY = image.pixels[x][y+1].brightness

            let vector = Vec([finiteDiffX, finiteDiffY])
            let angle = vector.angle()

            if (angle > π/2) {
              angle -= π/2
            }

            addToBins(angle)
            total += 1
            
            totalR += image.pixel.r;
            totalG += image.pixel.g;
            totalB += image.pixel.b;
          }
        }

        frequencies = idft(bins)
        playFrequencies(frequencies)

The Sobel filter is a special image operation akin to a blur or an increase in contrast that is used before detecting edges (Fisher et al., 2003). It looks like this:

QR Encoding

The final conversion algorithm to highlight here is "QR encoding". Like TapTapSee, this algorithm chooses to offload the work of identifying what's in the image to the computer rather than the user. While TapTapSee works with identifying any object, the QR encoding algorithm only recognises QR codes that encode information relevant to the visually impaired. For example, textured QR codes could accompany braille and allow the visually impaired easier access to information about things such as health and safety warnings or public transport information.

To try this out interactively, select "encoding information in qr codes" from the dropdown at the top of the document. This version of the algorithm treats QR codes as corresponding to pitches, and a tool for generating your own QR codes to try this out is available here. Here are the QR codes that represents the nodes A & C:

While QR codes representing notes on a scale wouldn't be that useful, a QR code could represent anything. Here is a QR code that will play a short recording of a documentary about the Tower of Babel when shown to the camera:

One problem with this approach is that QR codes can only contain a very small amount of data (around 24,000 bits) (www.qrcode.com, n.d.) and so it would not always be possible to put the actual information inside the QR code itself. Instead, the QR codes would need to act as something like web addresses which the camera would use to look up the content from a central server. In the Tower of Babel QR code example above, the only information contained is the word "babel" which the software recognises as an instruction to play the clip from the documentary. This approach would mean that much more data could be represented with a QR code. There would be around \(2^{24,000}\) possible addresses for use. In very abstract pseudocode, the algorithm would look like the following:

        
        let qrCode, exists = video.scanForQRCodes();
          
        if (exists) {
          let data = lookupQRCode(qrCode);
          playSound(data);
        } else {
          continue;
        }

Conclusion

In conclusion, there exist many different ways that you can convert video information to sound. This essay has identified a few requirements and limits that must be placed on the algorithms themselves and presented a few new techniques that could be considered for the future.

Each method for converting video information has its own advantages and disadvantages. Perhaps a future tool will synthesise all these paradigms together into a device that allows the visually impaired to use a variety of conversion techniques in order to extract as much information from the visual world as they can.

Making this essay

It was my original intention for this essay to be around 2,000 words but after realising that providing enough context for a discussion of the algorithms implemented in the artefact would be tricky, I changed the limit to 5,000 words. There's a lot of things I would do differently if I could start again:

Create a more detailed plan at the beginning. This would mean less headaches trying not to exceed the word limit and would have made the essay easier to write.
Organise my references better. After re-reading a first draft, I'd failed to cite quite a few of my sources. If I'd kept better track of my references from the beginning it would have made the process of adding the correct attributions afterwards much easier.
Pick a better format than a webpage. Making the essay interactive has meant that it's easier to illustrate points using examples, but has made it harder to write and share at the end of the process.
Narrow the scope of my question. If I were to start from scratch again, I would further narrow my title as to limit the scope of what I feel like I had to write about.

Copyrights and Licenses

Several pieces of free and open-source software were used in order to create this document. They are attributed below:

p5.js was used for creating the interactive demonstrations. It is licensed under the GNU Lesser General Public License v2.1.
p5.js-sound was used for generating the sounds accompanying the interactive examples. It is licensed under the MIT License.
latex.css was used for creating the layout of this webpage. It is licensed under the MIT License.
qr-scanner was used for recognising QR codes. It is licensed under the MIT License.
MathJax was used for rendering mathematics. It is licensed under the Apache License 2.0.

References

Box, P., Van Der Maaten, L., Postma, E. and Van Den Herik, J. (2009). Dimensionality Reduction: A Comparative Review.
Fisher, R., Perkins, S., Walker, A. and Wolfart, E. (2003). Feature Detectors - Sobel Edge Detector. [online] homepages.inf.ed.ac.uk. Available at: https://homepages.inf.ed.ac.uk/rbf/HIPR2/sobel.htm.
Kiran, A., Mounika, M. and Srinivas, B. (2018). Modern Lossless Compression Techniques: Review, Comparison and Analysis.
Koch, K., McLean, J., Segev, R., Freed, M.A., Berry, M.J., Balasubramanian, V. and Sterling, P. (2006). How Much the Eye Tells the Brain. Current Biology, 16(14), pp.1428–1434.
Lenay, C., Gapenne, O., Hanneton, S., Marque, C. and Genouëlle, C. (2003). Chapter 16. Sensory substitution. Touching for Knowing, pp.275–292.
Leonard, S. (2016). Windows Image Media Types.
Markowsky, G. (2019). Information theory - Physiology | Britannica. In: Encyclopædia Britannica. [online] Available at: https://www.britannica.com/science/information-theory/Physiology.
Meijer, P. (n.d.). The vOICe - New Frontiers in Artificial Vision. [online] www.seeingwithsound.com. Available at: https://www.seeingwithsound.com/.
Meijer, P.B.L. (1992). An experimental system for auditory image representations. IEEE Transactions on Biomedical Engineering, 39(2), pp.112–121.
Purves, D., Augustine, G.J., Fitzpatrick, D., Katz, L.C., Anthony-Samuel LaMantia, McNamara, J.O. and S Mark Williams (2018). The Audible Spectrum. [online] Nih.gov. Available at: https://www.ncbi.nlm.nih.gov/books/NBK10924/.
Swanson, J. (n.d.). An Interactive Introduction to Fourier Transforms. [online] www.jezzamon.com. Available at: https://www.jezzamon.com/fourier/ [Accessed 18 Feb. 2022].
Taptapseeapp.com. (2019). TapTapSee - Blind and Visually Impaired Assistive Technology - powered by CloudSight.ai Image Recognition API. [online] Available at: https://taptapseeapp.com/.
www.qrcode.com. (n.d.). Information capacity and versions of QR Code | QRcode.com | DENSO WAVE. [online] Available at: https://www.qrcode.com/en/about/version.html.