Data Scientists’ Guide to Analysis & Modelling of Human Data from Biometric Sensors

Data Flow Illustration 2000X1000

It should be a data scientist’s dream. Sensors are now everywhere – in our homes, vehicles, workplaces and pockets – and a vast infrastructure for IoT and edge processing is emerging to soak up the deluge of data being spewed from all those sensors. At last, everything that can be measured either is or soon will be. So why is this a nightmare for many data scientists?

Note: This is part of a four-part series on Human Data-Based R&D – see end for more.

Working with sensor data is notoriously fiddly. More so, biometric data, sensed from and around the human body, also means operating at the fuzzier end of the analytical spectrum, often in complex and dynamic real-world environments. This introduces some of the tougher issues that come with the explosion of new data sources. But it also presents us with a transformative opportunity: to understand how people truly feel, and build systems that can adapt to them appropriately. In other words, empathic technology.

Below we take a brief look at some of the issues our data science team has encountered while measuring human data, both in the lab and in the wild, over recent years – and how we have hopefully solved them.

Building Multimodal Sensor-Based Datasets

At the heart of the challenge is a multiplicity of data streams. Across the broad range of disciplines that is now working with sensors to research human feelings, physiology or behaviour, there is an established understanding that multimodal is the way to go. As you increase the number of modes of data you can measure, the more robust, consistent and accurate your analysis can be. If this argument isn’t already very familiar to you, you might want to read this post for some background: Sensor Fusion: The Only Way to Measure True Emotion.

Wrangling data from multiple sensors typically means dealing with a cacophony of different time-series streams. This can bring about a bunch of awkward nuances and edge cases that you need to overcome: varying sample rates, reifying states out of on/off-style event streams, managing multiple data sources and types, gracefully handling sensor or connection failure, to name a few. And of course these problems grow as you add new sensor streams to the system.

Alongside the raw sensor data, it is likely that you will want to view context media too, such as video and audio. Certainly, those are key channels in the kind of multimodal analysis we get our best results from, to understand what is going on as the participant experiences the emotional journey being studied. You might also wish to incorporate other context data streams such as GPS location, machine analytics, participant profile information, etc.

In the end, even if you’re turned on by the thought of unravelling these kinds of digital Gordian knots, you still need to be confident that the resulting data and media streams are synchronised, and remain so indefinitely – accounting not just for varying sample rates but also latency issues from processing heavy files like video.

Yeh, good luck with that. Over the years, we’ve collected enough blisters from trying to navigate rough data-wrangling paths not to retrace our steps if we can avoid it.

Measuring Human Data: The Process

Over many iterations of gathering and cleansing multimodal datasets, our team has been perfecting a system that automates the process for us. The idea is to allow our data scientists – along with psychologists, UX researchers and anyone else who wants to play with the data – to get straight into analysis without needing to cleanse and consolidate everything themselves. The challenge was to get all sensor and context data into one database, synchronised and tagged, and to be able to analyse it alongside synchronised media from any cameras and microphones present.

And of course, this is the bit where we say, “ta-da, we have a product for that!”

Was it that obvious? It can’t be helped, we’re proud. We’re rolling out a new version of our Synsis™ Empathic AI Kit, which we built to solve the problems we have faced in our own work, as much as to ease our customers’ pains. For a data scientist exploring human physiology, emotions and other states, the Kit is designed to get you over the hurdles of ingressing and managing your sensor data in a few minutes so you can skip straight to where you need to be: analysing an orderly dataset and training models on it.

Let’s walk through an instructive example quickly, from a real-world driving study:

Sensum Dev Kit Car Setup 3776 E1 16X9 2000X1125
Testing our in-cabin research setup before a study.

In this example, the driver is surrounded by the following sensors:

  1. Chest-worn biometric sensor belt – recording heart rate, breathing rate, skin temperature, skin conductance (from fingers), and more.
  2. Rear-facing camera – for facial-coding analysis.
  3. Front-facing camera – for context of their current situation.
  4. Microphone – for voice analysis.
  5. Smartphone – for contextual data, primarily GPS location.

The data and media from all those sources come in a wide range of formats and parameters, with varying latency issues on top.

Our Synsis box sits in the vehicle, wirelessly collecting and syncing data from all the sensors, and recording any events that are tagged either manually by a researcher, or automatically by the system, during the session. The same device runs our models to make real-time predictions of several human states, including valence (positive-negative feeling), arousal (relaxed-excited), stress and comfort. At the end of the recording session, the raw data and features are then automatically uploaded to the cloud and synchronised to the corresponding media content and location.

Synsis Ui Example 002
Screenshot from Synsis dashboard during a road driving study.

Now the raw and derived data are available for interrogation, alongside the media and context data, in a GUI. Alternatively, you can access the database directly to run your own analysis, or use our API to pull streams for other functions like human-machine interaction or data visualisation.

Training Models of Human States

And now the fun starts. With an orderly dataset in your hands, you can get on with training your own models.

At the heart of our technology are the human state models that we have developed, which use psychophysiological theories to predict a range of emotional and cognitive states based on the incoming sensor data. But you also have the option to develop your own models alongside ours, if you wish. In brief, the process goes like this:

We use the Synsis Kit both in our lab and on the road as a platform for moving through the above steps in iterative cycles, to generate new models and refine existing ones. Of course there is more detail to the process which, without getting buried in it here, typically includes:

  1. Literature review to inform the design of experimental protocols and ground truth stimulation, based on established theory.
  2. Pre- and post-session surveys for self-report data from participants.
  3. Live tagging during sessions to generate real-time annotation.

We also conduct post-session annotation using trained annotators, to reinforce the data integrity, as well as the resulting models. This step isn’t currently facilitated by the toolset in the Synsis platform but it’s on our roadmap.

And there’s more besides. Measuring and modelling human moods and behaviours with automated digital tools is a complex process but hopefully we’ve outlined some of the key considerations above. Now it’s time to ask you to share any thoughts, horror stories, ideas or concerns you have in this space, as your feedback is what guides our product development priorities.

Any and all input is welcome at

Read on…

In this series:


Ben Bland

Chief Operations Officer