Is it possible to convert video information to sound in order to help those suffering with visual impairment?
Oliver Britton EPQ project
This is an interactive document exploring different approaches to
using "sensory substitution" to convey visual information to
those suffering with blindness. This involves identifying some key requirements and limits for the
technology, looking at existing approaches to the problem and presenting new some approaches for
consideration. You may be asked to
enable
webcam in order for some of the interactive examples to
work.
Conversions
This is the main artefact of the project. Here you can try out
different approaches to converting your webcam's video feed to sound,
and see a visualisation of the generated sound wave on the left and
the intermediate processing steps on the right. The sound wave
visualisation consists of the waveform (the black line) and the
frequency spectrum (the orange lines).
You can toggle generating sound by clicking on the waveform.
Explanation
This is the 'mean RGB brightness harmonics' conversion scheme. Here, the average brightness values of the red,
green and blue parts of the image are calculated and then these
brightness values are mapped to a frequency range. In this example, red
goes from 100Hz to 500Hz, green from 500Hz to 1000Hz and then blue from
1000Hz to 1500Hz, which correspond to the three distinct groups that can
appear on the waveform.
This means that you can hear how much red, green and blue are in an
image. To do this, listen carefully for one of the "notes" in the
overall "chord". If you hold up a blue object in front of the webcam,
then you'll hear one of the notes getting louder, while the others get
quieter.
One application of this could be for colour blind people when they are
using the London Underground. Although this demonstration focuses on the
whole field of view, a future device could connect to a phone and only
generate sounds for what is currently in focus, so as they traced the
tube lines with their eyes, they could hear a faint auditory indicator
telling them what colour it is.
Introduction
The process of converting images to sound is the idea of sensory substitution, where you take information
that would normally come through one sense and replace it with a new way of conveying it through another sense.
This is normally due to the original sense being impaired or faulty, which is why sensory substitution's main use
case is
for developing
technology for disabled people. Braille is
probably the most famous form of sensory substitution by restoring the ability to read by through
tactile stimulation.
Sensory substitution devices consist of three components (Lenay et al., 2003):
A sensor, which records the information that would normally be coming to the sense being replaced,
A coupling system, which converts the information from the sensor into something that can be
interpreted by the new sense, and
A stimulator, which delivers the information to the new sense.
This essay explores the coupling system for a sensory substitution system with a camera as the sensor and
headphones or speakers as the stimulator. The coupling system here is interpreted as a conversion algorithm
which converts between representations of images into representations of sounds.
How do you represent images and sounds?
In order to create an algorithm for converting between images and sounds, we need a standard digital
representation that allows for efficient manipulation. There are many different ways of
representing images digitally, but the easiest format to work with is a standard 24-bit bitmap image. In this
representation, an image is stored as a list of pixels ("picture elements") which contain three values
corresponding to the proportion of red, green and blue at a certain point in an image (Leonard, 2016). Each color
is allocated 8
bits, which allows for the numbers 0 through to 255 to be represented, where 0 is none of the respective colour
and 255 is the maximum possible amount of that color. For example:
\((255, 0, 0)\), only red.
\((211, 6, 184)\), lots of red and blue but not much green.
\((0,0,0)\), completely black.
Sounds, on the other hand, are a little trickier to represent. Because sounds are analog opposed to digital, it's
hard to describe the exact shape of a sound wave on a computer. One way around this is to
leverage the fact that every sound can be represented as the sum of sine waves, and store the constituent sine
waves that
make up the sound up to a cutoff.
Sums of sine waves
Mathematically, the process of converting a sound into the sum of sine waves is known as a Fourier transform
(Swanson, n.d.),
named
after the French mathematician Jospeh Fourier. Given a function \(f(t)\) (corresponding to the acoustic
pressure/amplitude for a sound at
different points in time), then the Fourier transform is:
$$
\hat{f}(\xi) = \int^\infty_{-\infty} f(t) e^{-2\pi i t \xi} \text{d}t
$$
Giving the
function \(\hat{f}\) an input of \(\xi\) will return the magnitude of how much of that frequency is present in
the sound. The process of computing
this integral is difficult analytically, but as we are working with digital sounds and don't have a continuos
signal, we can use the "discrete Fourier transform" which isn't as computationally expensive.
The Fourier transform comes up again when describing the Histogram Harmonics
conversion algorithm, where it is used in reverse to convert an angle-frequency function into a
frequency-amplitude function.
This means that we can represent sounds as a list of sine waves, just like how we can represent an image as a list
of pixels.
Limits on this approach
There is a fundamental limit on using sensory substitution to convert images to sound: you can fit much more
information through the eye then you can through the ear in the same amount of time. Current estimates suggest
that the human eye can transfer roughly ten million bits of data to the brain per second (Koch et al., 2006),
whereas the ear can
transfer a maximum of one hundred thousand bits per second (Markowsky, 2019). For this reason, there has to be a
large trade-off
where a
majority of the fine detail in an image
is removed in order to save bandwidth.
It is also worth considering the abilities of a computer when converting the images into sound. While
a computer can operate on vastly larger amounts of data than the conscious human brain (Markowsky, 2019), a
potential conversion
algorithm
could be so complex that it couldn't keep up on relatively small hardware (which may be necessary for any
real-world application as hauling around a large computer wouldn't be very practical).
Moreover, computers present an additional reason to reduce the details in the image due to the quadratic
nature of
looping through each pixel individually.
Overall, two trade-offs need to be made for this approach to work. The first is that we need to reduce the sheer
amount of data (Dimensionality Reduction) going into a conversion
algorithm and the second is to ensure that the procedure for converting images is simple enough that a computer
can perform quickly and efficiently (Computational Efficiency).
Furthermore, we also need to ensure that the data is human-interpretable (Human-interpretability) so that a person using a device can understand what
they're hearing.
Dimensionality Reduction
Dimensionality reduction is the process of shrinking down data in a way that still preserves useful properties
despite the reduced size (Box et al., 2009). For example, sounds are compressed on a computer so that they still
have the useful
properties that
humans expect (pitch,
rhythm, timbre) but don't take up as much space.
This is important because it both reduces the amount of work that the computer needs to do and
makes it easier for the human to understand — there's no point sacrificing the clarity important details of an
image in order to make sure the irrelevant ones are also carried preserved.
But how do we do this in practice? Here are a few approaches you could take:
Reducing the resolution of the image. While it would be appealing to have as sharp an image
as possible, the data transfer rate of the ear compared to the eye means we need to make big sacrifices in the
resolution of the image. Reducing the resolution would be the same as taking the photo with a blurrier camera:
if we originally had an HD photo of \(1920 \times 1080\) pixels, we can get away with converting just a \(240
\times 135\) image and still being able to recognise the image.
Giving less space to colors. The representation for images presented above gave a range of
0
to 255 for the different components of red, green and blue, represented by 8 bits. If instead we only gave 4
bits to each pixel, we could represent 0 to 16. The image would be much poorer quality but also much smaller.
Get rid of colors entirely. Even more drastically, we could get rid of colors entirely and
have each pixel show only the overall brightness at each section of an image. Since we'd only need one value
representing overall brightness instead of three at each pixel.
Only show where the edges are. Many images are recognisable from just the outlines of
shapes
alone. This gives us another opportunity for drastically reducing the size of the image.
Splitting the image up. Another approach would be to run the image conversion scheme on the
separate parts of the image and play the sounds one after another. This would reduce the rate at which a
device
could react to a new image, but would mean that a lot more detail could be carried across but just in a larger
amount of time. For example, considering each vertical strip of the image separately one after another could
turn the input from a \(240 \times 135\) image into a just a \(1 \times 135\) column.
Using visual descriptors. When training object detection algorithms, a common approach
is to use a series of "visual descriptors" rather than the actual raw image itself. For example, an algorithm
might convert an image into a histogram of edge gradients (a "HOG") which will be similar for images of the
same object. This means you can massively compress images down from millions of pixels to a few thousand
visual
descriptors.
Typically, a
conversion algorithm will use several steps of dimensionality reduction (Meijer, 1992) such as reducing the
resolution of the
image, using only brightness values and splitting up the image into strips in order to reduce the work for both
the human and the computer as much as possible.
Human-interpretability
One careful consideration needs to be made for whichever techniques of dimensionality reduction are used:
the
data still needs to be interpretable by humans. A computer might be able to understand a super concise
representation of images generated by an autoencoding algorithm, but it might be impossible for humans to
understand.
Modern compression algorithms can reduce the size of an image by a large factor without sacrificing any
quality,
in a procedure known as "lossless" compression. This can involve techniques like identifying redundant or
repeated data, or shortening down the amount of data required to describe a small section of an image (Kiran,
Mounika and Srinivas, 2018).
Lossless compression algorithms can not be used by sensory substitution devices because the data that comes
out of these
algorithms is not human-interpretable at rest. When a computer needs to display the image, it first has to
uncompress it so that it is available to display on the computer screen.
This
means that while compression is useful for communicating between computers, it's not useful for communicating
from a computer to a human.
Care needs to be taken that the method used for converting the image to sound (rather than
reducing
the dimensionality) is also human interpretable. To do this, the features of the image that once corresponded
to brightness or color needs to be mapped onto features sound. A few ways this could be done include:
Brightness \(\to\) Amplitude (loudness)
Color \(\to\) Frequency (pitch)
Edge orientations \(\to\) Timbre (how an instrument sounds)
However, this can still go wrong:
Changes in amplitude might be so small that there is no discernable difference between two sounds that are
supposed to convey different information.
Frequencies far outside the range of human hearing (\(20\text{Hz}\) to \(20\text{kHz}\)) (Purves et al.,
2018) may accidentally be
used.
Although this would mean there was more information in the sound, it would be useless to a human
as they wouldn't be able to hear it.
Simplicity
An important part of designing a conversion algorithm is making sure that it doesn't introduce any unnecessary
complexity that will make it harder for a human to understand. Even if you were to use features of sound that
a
human could recognise, if the process of turning the image to sound included unnecessary steps, it would put a
larger burden on the person using the technology to understand and interpret what they are hearing.
This links back to why using a lossless compression algorithm would not be a good idea because information is
encoded using a complicated set of mathematical and computational operations that a human would struggle to do
manually in a short amount of time.
Small changes should lead to small changes
An important part of human interpretability is making sure that the conversion isn't too sensitive to small
changes in the input. In other words, making a small modification to the image shouldn't have a large effect on
the sound that gets generated. If this happens, it's a good indicator that the image algorithm will be too
complex
to be understood.
Try out the difference between two different "encodings" of an image, one very sensitive to changes and one
only
moderately sensitive. Try drawing a picture and seeing how making a small change in the image will affect the
two different values.
The first "sensitive" number will jump about when even making a small change to the image, whereas the bottom
"unsensitive" number will only change slightly.
Existing Technologies
So far, a few requirements and limits on how a system should convert images to sound have
been identified. Before presenting a few new techniques for consideration, it is worthwhile to look at
a few existing technologies that already exist.
The vOICe
The
vOICe (oh-I-see) describes itself as "augmented reality for the totally blind" (Meijer, n.d.) and has
had real-world
use since its conception in 1992. Not only has the project developed accessible and open-source technology for
this purpose but the organisation has published extensive research and literature reviews about the topic.
The algorithm at the core of their system is summarised by the following diagram:
Dimensionality of video data is reduced by converting the image to black and white, down-scaling and splitting the
image up into vertical chunks. Each vertical component of the image is then assigned to a sine wave of a
particular frequency who's amplitude is modulated by the relative brightness of that particular pixel. Each column
is then converted and played in sequence (Meijer, 1992).
This conversion scheme makes the trade-off of increasing the amount of detail carried across in an image by
increasing the amount of time it takes per frame. However, as a user gets more advanced at
identifying objects, they can increase the speed of transmission to a level they find comfortable. This technique
of allowing customisation makes the vOICE one of the most practically useful and adoptable technologies of its
kind. Users of the vOICe can navigate urban environments, exercise outdoors and identify specific objects like a
dropped cane or book.
TapTapSee
TapTapSee is an open-source mobile app that takes a different approach to the vOICe and chooses to offload the
work of
recognizing specific objects in a scene to a computer and then having these reported back to the user via speech
(Taptapseeapp.com, 2019).
In this example, a blind user has pointed their
phone at a flowerpot and wishes to know what type of flower it is.
>
By leveraging the an external image recognition API, the app is able to recognise that it is a
"dieffenbachia"/"dumb cane" and report it back via text-to-speech. While not as broadly applicable as the vOICe to
everyday life and navigation, this
has the advantage of complex objects and even text that a system like the vOICe would struggle with.
Conversion Algorithms
This section outlines some potential algorithms for converting images to sound. While not as practically useful or
complicated as the algorithms present in projects like the vOICe or TapTapSee, they could provide an
interesting starting point for further research and consideration. All of the algorithms described below are
available to try interactively at the top of this document.
Brightness to Frequency
Perhaps the simplest of all possible algorithms, the brightness to frequency procedure could be summarised as
follows:
Take input image
Reduce width and height
Find the mean value of the RGB component of all pixels
Create a sine wave with a low frequency for low mean and a high frequency for high mean
This creates a mapping between how bright the overall image is and how high the frequency of the produced sound,
allowing you to "hear brightness".
In this example, the input image is a photo of a Rubik's cube on a lined sheet of paper:
As you can see, the process of calculating the overall brightness removes any colour present in the image.
Specifically, the
overall intensity of the image is detected to be \(0.61\) out of \(1\), and so a sine wave is created 61%
through the interval \([200\text{Hz}, 1000\text{Hz}]\), roughly corresponding to a width of 4 octaves centered
around to F4 on a piano. In pseudocode, the algorithm might look like this:
let totalBrightness = 0;
for (var y = 0; y < video.height; y++) {
for (var x = 0; x < video.width; x++) {
totalBrightness += image.pixels[x][y].r;
totalBrightness += image.pixels[x][y].g;
totalBrightness += image.pixels[x][y].b;
}
}
playFreq(
map(
totalBrightness,
// Total possible brightness would be if all
// pixels were 255 in R, G and B:
0, video.width * video.height * 255 * 3,
// Convert to between 200Hz and 1000Hz.
200, 1000
)
)
RGB to Frequencies
The idea of converting intensity to pitch can be extended by also accounting for the individual brightness of each
red, green and blue component of an image. This is done in a similar fashion to mono-chromatic case:
Take input image
Reduce width and height
Find the mean value of the red component of each pixel
Find the mean value of the blue component of each pixel
Find the mean value of the green component of each pixel
Find the ratio of majority red to green to blue pixels
Create three sine waves, with the amplitude and frequency of each corresponding to this ratio and brightness
For a practical example of this algorithm in action, here is the example of a Rubik's cube again but this time
with each colour channel treated separately:
The additional colours around the outside are an artefact of how white is the colour equal in red, green and blue.
If a white object like the sheet of
paper in the above image is ever so slightly more red than it is green or blue, it will be treated as red in the
visualisation. While this appears to have a drastic effect in the above image, the outputted sound will not be
significantly effected as all colours are being distorted roughly the same amount which only leads to a distortion
in the calculated amplitude of each sine wave, rather than the frequency (which is much easier to identify).
In the final sound, there will be three groups that will show up on a spectrogram:
The black line represents the overall wave formed from the superposition of the harmonics and the three vertical
columns represent the red, green and blue channels
respectively. This technique would allow a visually impaired person to "hear" colour, such as being able to
discern what type of fruit something is without having to feel it. In pseudocode, it might look like this:
While using the red, green and blue channels to form harmonics allows a visually impaired person to hear colour,
this conversion
algorithm would allows a visually impaired person to hear orientations. The process looks like this:
Take input image
Reduce width and height
Apply a Sobel filter to the image
Use finite differences to calculate an estimate for edge orientation
Create a histogram of gradients (HOG), binning pixels in the image according to their edge orientation
Run an Inverse Discrete Fourier Transform (IDFT) on this histogram to create a family of sine waves
Informally, it's like looking at the angles of all the edges in an image and then plotting a histogram showing
which angles are the most likely. This distribution is then used to create sine waves, so encoded in each is
information about which way edges are facing. In some ways, this technique is less useful than the colour
harmonics conversion scheme. Here is the algorithm run an image of a Rubik's cube:
Due to the application of the Sobel filter (which is used for detecting edges), the details of the cube itself
have become almost entirely unrecognisable. Here, each white line represents the calculated orientation of each
edge. However, on some images, this is more useful. Here is an image of a flat, horizontal black line:
Using the colour harmonics approach, this image would be unrecognisable from this image:
However, there is a clear difference in the two image's resulting sound when running through his algorithm, as
can be seen from these two
spectrograms:
The pseudocode for this algorithm is more complex than for the other two, but could be summarised as so:
applySobelFilter(video)
let total = 0;
let bins = [...];
for (var y = 0; y < video.height-1; y++) {
for (var x = 0; x < video.width-1; x++) {
let brightness = image.pixels[x][y].brightness
let finiteDiffX = image.pixels[x+1][y].brightness
let finiteDiffY = image.pixels[x][y+1].brightness
let vector = Vec([finiteDiffX, finiteDiffY])
let angle = vector.angle()
if (angle > π/2) {
angle -= π/2
}
addToBins(angle)
total += 1
totalR += image.pixel.r;
totalG += image.pixel.g;
totalB += image.pixel.b;
}
}
frequencies = idft(bins)
playFrequencies(frequencies)
The Sobel filter is a special image operation akin to a blur or an increase in contrast that is used before
detecting edges (Fisher et al., 2003). It looks like this:
QR Encoding
The final conversion algorithm to highlight here is "QR encoding". Like TapTapSee, this
algorithm chooses to offload the work of identifying what's in the image to the computer rather than the user.
While TapTapSee works with identifying any object, the QR encoding algorithm only recognises QR codes that encode
information relevant to the visually impaired. For example, textured QR codes could accompany braille and allow
the visually
impaired easier access to
information about things such as health and safety warnings or public transport information.
To try this out interactively, select "encoding information in qr codes" from the dropdown at
the top of the
document. This version of the algorithm treats QR codes as corresponding to pitches, and a tool for generating
your own QR codes to try this out is available here. Here are the QR codes that represents the
nodes A & C:
While QR codes representing notes on a scale wouldn't be that useful, a QR code could represent anything. Here is
a QR code that will play a short recording of a documentary about the Tower of Babel when shown to the camera:
One problem with this approach is that QR codes can only contain a very small amount of data (around 24,000 bits)
(www.qrcode.com, n.d.)
and so it would not always be possible to put the actual information inside the QR code itself. Instead, the QR
codes would need to act as something like web addresses which the camera would use to look up the content from a
central server. In the Tower of Babel QR code example above, the only information contained is the word "babel"
which the software recognises as an instruction to play the clip from the documentary.
This approach would mean that much more data could be represented with a QR code. There would be around
\(2^{24,000}\)
possible addresses for use.
In very abstract pseudocode, the algorithm would look like the following:
let qrCode, exists = video.scanForQRCodes();
if (exists) {
let data = lookupQRCode(qrCode);
playSound(data);
} else {
continue;
}
Conclusion
In conclusion, there exist many different ways that you can convert video information to sound. This essay has
identified a few requirements and limits that must be placed on the algorithms themselves and presented a few
new techniques that could be considered for the future.
Each method for converting video information has its own advantages and disadvantages. Perhaps a future tool
will synthesise all these paradigms
together into a
device that allows the visually impaired to use a variety of conversion techniques in order to extract as
much information from the visual world as they can.
Making this essay
It was my original intention for this essay to be around 2,000 words but after realising that providing enough
context for a discussion of the algorithms implemented in the artefact would be tricky, I changed the limit to
5,000 words. There's a lot of things I would do differently if I could start again:
Create a more detailed plan at the beginning. This would mean less headaches trying not to
exceed the word limit and would have made the essay easier to write.
Organise my references better. After re-reading a first draft, I'd failed to cite quite a
few
of my sources. If I'd kept better track of my references from the beginning it would have made the process
of
adding the correct attributions afterwards much easier.
Pick a better format than a webpage. Making the essay interactive has meant that it's
easier
to illustrate points using examples, but has made it harder to write and share at the end of the process.
Narrow the scope of my question. If I were to start from scratch again, I would further
narrow
my title as to limit the scope of what I feel like I had to write about.
Copyrights and Licenses
Several pieces of free and open-source software were used in order to create this document. They are attributed
below:
Box, P., Van Der Maaten, L., Postma, E. and Van Den Herik, J. (2009). Dimensionality Reduction: A
Comparative
Review.
Fisher, R., Perkins, S., Walker, A. and Wolfart, E. (2003). Feature Detectors - Sobel Edge Detector.
[online]
homepages.inf.ed.ac.uk. Available at: https://homepages.inf.ed.ac.uk/rbf/HIPR2/sobel.htm.
Kiran, A., Mounika, M. and Srinivas, B. (2018). Modern Lossless Compression Techniques: Review, Comparison
and
Analysis.
Koch, K., McLean, J., Segev, R., Freed, M.A., Berry, M.J., Balasubramanian, V. and Sterling, P. (2006). How
Much
the Eye Tells the Brain. Current Biology, 16(14), pp.1428–1434.
Lenay, C., Gapenne, O., Hanneton, S., Marque, C. and Genouëlle, C. (2003). Chapter 16. Sensory substitution.
Touching for Knowing, pp.275–292.
Leonard, S. (2016). Windows Image Media Types.
Markowsky, G. (2019). Information theory - Physiology | Britannica. In: Encyclopædia Britannica. [online]
Available at: https://www.britannica.com/science/information-theory/Physiology.
Meijer, P. (n.d.). The vOICe - New Frontiers in Artificial Vision. [online] www.seeingwithsound.com.
Available
at:
https://www.seeingwithsound.com/.
Meijer, P.B.L. (1992). An experimental system for auditory image representations. IEEE Transactions on
Biomedical
Engineering, 39(2), pp.112–121.
Purves, D., Augustine, G.J., Fitzpatrick, D., Katz, L.C., Anthony-Samuel LaMantia, McNamara, J.O. and S Mark
Williams (2018). The Audible Spectrum. [online] Nih.gov. Available at:
https://www.ncbi.nlm.nih.gov/books/NBK10924/.
Swanson, J. (n.d.). An Interactive Introduction to Fourier Transforms. [online] www.jezzamon.com. Available
at:
https://www.jezzamon.com/fourier/ [Accessed 18 Feb. 2022].
Taptapseeapp.com. (2019). TapTapSee - Blind and Visually Impaired Assistive Technology - powered by
CloudSight.ai
Image Recognition API. [online] Available at: https://taptapseeapp.com/.
www.qrcode.com. (n.d.). Information capacity and versions of QR Code | QRcode.com | DENSO WAVE. [online]
Available
at: https://www.qrcode.com/en/about/version.html.