Sensor Fusion: The Only Way to Measure True Emotion
Artificial intelligence is learning to deal with our emotions. One of the major current developments in smart software and machines is that we are training them to understand our emotional states, in order to provide more appropriate, more effective, and more human, products and services.
For the last seven years we have been measuring people’s emotions ‘in the wild’, across a huge range of realistic scenarios. In that time it has become increasingly obvious that you can’t depend on any one data stream to be able to infer emotions accurately or reliably. A more holistic approach is needed.
Here’s why and how…
Note: this is a Sensum Lab Notes post, curated by our Lead Cognitive Psychologist, Nicole Andelic — written from the coal-face of studying human emotions and building empathic technology.
There have been great improvements recently in the world of emotion research, largely resulting from the shrinking costs and growing accuracy of biometric sensors. But no single sensor type gives a complete picture of a person’s emotions. Some data streams tell us how intense the emotion is, others indicate the activity we are doing. Some streams provide emotion analysis from our tone of voice or our facial expressions. Although many emotion tech providers are experts in one particular sensor or data type, at best this will give you reductive information, at worst it will be misleading.
To illustrate an example that is apt to current technology trends, imagine a car driver who has to suddenly push their brakes.
Our driver is calmly cruising down a main road when another car pulls out in front from an adjoining road. The driver feels a sense of alarm as they kick the brake pedal and grimace in fear, anxiously hoping the vehicle will stop short of a collision. The car stops in time, just. The driver sighs with relief. Moments later, they catch the eye of the person at the wheel of the other car. Our driver scowls angrily and shout obscenities, before screeching the tires and racing off up the road.
Now, from the perspective of the ‘blind’ data collected from the vehicle and driver, we can watch the same incident unfolding from behind the scenes.
Let’s say the driver is wearing a heart rate sensor, has a camera and microphone in the cockpit, and has permitted those devices to send their data to an emotion-processing engine. The driver’s heart rate would probably show a sudden spike, then gradual fall-off. This might indicate an emotional response to an incident such as nearly colliding with another vehicle, but it might just as easily be that the driver has just drunk some coffee or listened to a favourite song on the radio. Here, one data stream is not enough.
At the same time, the onboard camera shows the driver’s face screw up in terror, and the microphone records their enraged shouts. These emotional signals can be measured by facial-coding and voice-analysis software. Here, we may see outputs from the data such as ‘scared’ then ‘angry’, adding more detail to our understanding of the driver’s response. But again, scared and angry at what? If we combine all of these physiological data streams with, say, speed, engine and braking data from the car, or a context video from another, front-facing camera, we could establish a meaningful picture of the scene.
With this compound image of synchronised data sources we can uncover rich insights to feed the design of more empathic, and thus more natural and intelligent, products and services. We can teach our machines how to respond appropriately in the moment.
To do so, we need sensor fusion.
By recording and analysing multiple data streams we can paint a more nuanced picture of the user’s scenarios and emotions. We can interpret various streams of information simultaneously, emphasising those that are providing the most robust and relevant data at that moment. This is what the human brain does all the time with minimal conscious effort. Our senses provide a range of information that the brain prioritises differently from one situation to the next, to make our understanding of the world more relevant, useful and efficient.
This kind of multimodal analysis not only provides us with clearer insight into emotions, behaviour and scenarios, it also offers the insurance of maintaining a constant signal when one stream drops out or loses clarity.
Further, by collating simultaneous feeds from different sources we can validate existing measurements against each other, to check their accuracy and relevance. Sensor fusion might also offer cost efficiencies by allowing the use of low-cost, less accurate sensors, which are compensated by input from the other data streams.
Finally, sensor fusion can guard us against external factors interfering with our metrics. For example, if a sensor claims that the user’s skin temperature is rising, we might be able to infer an emotional response from that change. But if we also see from the pattern of their heart rate that the individual remains calm, it may just be that the external temperature is rising and warming their skin.
Although we are convinced that multimodal data capture is the future of emotion research, as well as the resulting empathic technology that comes from it, there are still many challenges that need to be solved. Here are some we deal with frequently:
- Wireless sensor signals interfere with each other.
- Multiple sensors, and their data, need to be synchronised.
- Large amounts of data to store, upload and analyse.
- Picking out the features that are relevant, especially in ‘noisy’ environments.
- Feeding back the information to the user without distracting them.
We have built solutions for all of the problems above but the enhancement and optimisation of those solutions is an ongoing process. We are always finding new issues as we continue to gather data from humans ‘in the wild’, tweak our products and algorithms based on the findings, then go round again.
What is unlikely to change is the need to look for the emotion in multiple simultaneous data streams. We can get a head start on this by learning from nature, by recognising that us humans are multimodal.
Imagine you are the driver who pulled out from the side road and caused the near-collision described above. How you would interpret and respond to the emotions of the situation? By the look on the other driver’s face, the stance of their body, the venom in their voice as they curse you, how do you interpret their feelings and predict their next actions? Consider how your emotions would drive you to respond.
Along with your felt emotions, you would contextualise your sensory input with a wide range of other inferences from the surrounding environment, based on your prior knowledge of how the world, and us emotional beings, work. Did you look before you moved? Are there other immediate dangers around you? Do you have space to manoeuvre? The context can radically alter your emotional response.
This is what we must teach our machines to do: interpret the surrounding world in real time, and respond to our behaviour and feelings appropriately, just like a good friend would.
This is the first of what we hope will be an enlightening series of insights from our fieldwork, called Sensum Lab Notes. We’re keen for any suggestions, criticisms or other feedback. Bring it on!