The controlled experiment approach to evaluating user interfaces Research methods for Human-Computer Interaction

This post is part of a series of notes I collated during my studies at UCL’s Interaction Centre (UCLIC).

The question most commonly asked: “Does making a change to the value of variable x have a significant effect on the value of variable y?”

  • x could be an interface or interaction feature.
  • y could be the time to complete task, number of errors, work load, the user’s subjective satisfaction, etc.

Controlled experiments are more widely used in HCI research than in practice (costs of design and running experiments outweigh their benefits).

Consider what the appropriate user population is:

  • A representative sample of the user population is recruited as participants (not always feasible).
  • If a non-representative sample of users is involved, consequences must be considered.
  • How many participants to recruit depends upon power of statistical tests, time available for study, ease of recruiting participants, incentives available for participant rewards, etc.

Blandford et al. (2008) – VIP:

  • Vulnerable participants.
  • Informed consent.
  • Privacy, confidentiality and trust.

Ensure all participants are informed of the purpose of the study and what will be done with the data.

Anonymise data where possible.

Offer the participants the opportunity to talk about the experiment in a debriefing session after they have finished the tasks.

Data stored in accordance with legislation:

  • UK Data Protection Act.
  • Need to register with the government if identifiable data is stored.

A controlled experiment tests a hypothesis – the effects of a design change on a measurable performance indicator.

Hypothesis example:

“A particular combination of speech and key-press input will greatly enhance the speed and accuracy of people sending text messages on their mobile phone.”

The aim of the experiment is to formally fail to prove the null hypothesis.

For the above hypothesis: You design an experiment, which in all fairness ought not to make any difference to the speed and accuracy of the texting.

The assumption that there is no difference between designs is the null hypothesis.

The study is designed to show that the interaction has no effect, within the bounds of probability.

The failure to prove the null hypothesis provides evidence that there is a causal relationship between the independent and the dependent variables.


  • Independent variable(s) – the variable that is intentionally varied (one or many).
  • Dependent variable(s) – the variable that is measured (time, error rate, workload).
  • Confounding variable(s) – variables that are unintentionally varied between conditions of the experiment and which affect the measured values of the dependent variable.
    • Example: testing two interfaces for text entry it could be that you use two different messages between devices, therefore affecting entry time.
    • Example: Complexity of certain words, using dictionary mode vs. not, using natural language vs. “text speak”.

A simple answer to the texting example is to make sure each message is input on each device.

The aim is to vary the independent variable in a known manner, to measure the dependent variable(s) and to minimise the affects of the confounding variables on the outcome of the study.

  • Avoid different rooms, different computers.
  • Randomise variables such as the time of day.
  • Be aware of differences between individuals.
    • Men vs. women
    • Personality
    • Aesthetic sensibilities
    • Cognitive skills
    • Age
    • Education level

4. Design: “Within Subjects” or “Between Subjects”Permalink to section titled 4. Design: “Within Subjects” or “Between Subjects”


  • Within subjects – each participant performs under all sets of conditions.
  • Between subjects – each participant only performs under one condition.
  • Mixed factoral – One independent variable is within subjects and another is between subjects.
  • Are participants required to compare interfaces (therefore a within subjects design)?
  • Are there likely to be unwelcome learning or interface effects across conditions (therefore a between subjects design)?
  • What statistical tests are planned?
  • Time take to complete the task? (The longer the time, the less likely a within subjects design can be used).
  • How easy is it to recruit participants? (The more people, the more feasible a between subjects design is).
  • Advantage: individual differences are less likely to influence results.
  • Disadvantage: possible learning effects, complex statistics.
  • Participants typically required to repeat similar procedures multiple times with different values of the independent variable.
  • Advisable to generate multiple tasks, one for each condition, for each participant to perform.
    • The task becomes an independent variable, but one with no direct interest in analysis.
    • Different tasks/values sometimes referred to as levels.
    • Example, 2 independent variables: mode of navigation (speech, key-press), mode of message (speech, predictive text, multi-tap).
Nav \ MessageSpeechPredictive textMulti-tap
SpeechCondition 123
  • Software simulators
  • [Smartphone] emulators

Often require some programming skills to create prototypes.

The procedure describes what the participants are going to do during the experiment.

  • Ensures that every participant in the experiment has the same experience, which helps eliminate confounds.
    Example, study of young vs. old – you don’t want to treat the latter more deferentially.
  • Allows others to replicate the experiment – the basis of “good science” – which helps eliminate confounds between entirely different attempts.
  • Control the order of the experiments –performance effects (learning), novelty effects.
  • Control the tasks that the participants are to perform – devise different tasks to avoid learning effects and reduce boredom.
  • Control the context in which the study is run – using a different computer or room, different time of day. Innocuous changes can influence results.

A systematic approach to variation is appropriate.

  • “Latin square” design – a grid in which every element appears precisely once in each row and each column.
    • The row represents the order in which elements are administered.
    • The column represents the sequence of participants.
Group \ TaskFirstSecondThirdFourth

Mixed factoral example – order of presentation of tasks and interfaces are systematically varied to eliminate order effects.

Group \ TaskFirstSecond
OneInterface 1, Task AInterface 2, Task B
TwoInterface 1, Task BInterface 2, Task A
ThreeInterface 2, Task AInterface 1, Task B
FourInterface 2, Task BInterface 1, Task A
  • Clear and consistent task instructions.
    • Level of detail required.
    • Reasonable time limit.
    • Engaging.
  • Piloting the experiment to ensure people behave as anticipated.
    • Ensure correct data is being gathered.
  • Ensure all equipment is working properly.
  • Probes the phenomenon more deeply.
  • Series of limited experiments each of which involves a simple controlled manipulation.
  • More reliable to conduct a series of investigations.

Updated on: 13 February 2021

Are you building something interesting?

Get in touch