Human Interaction with Voice-AI

Investigating how social factors mediate vocal alignment to virtual assistants

The use of voice artificial intelligence (voice-AI) – in the form of virtual assistants such as Amazon’s Alexa or Apple’s Siri – has grown increasingly common in recent years. Current research indicates that social factors that mediate linguistic accommodation during human–human communication, such as gender and likability, also mediate human–voice-AI communication; however, the magnitude of these effects differs from what we see in human-human communication. We further investigate the question of how human–voice-AI communication differs from human-human communication by testing whether variation in alignment toward speakers of different dialects is similar in apparent device and apparent human model talkers.

We conducted a shadowing task in which participants engaged in a question-and-answer “conversation” with text-to-speech (TTS) voices in British, American, and Indian dialects of English. To test the effect of a speaker’s top-down knowledge of the model talker’s guise, TTS voices were presented in either an authentic guise (i.e., as a device) or an inauthentic guise (i.e., as a human). A separate group of raters judged perceptual similarity between pre- and post-exposure productions and the model talkers’ production as an assessment of convergence. We find that participants converge more toward Indian English voices and diverge from American English voices; additionally, participants aligned more toward TTS voices when they were presented as human talkers, compared to when they were presented as device talkers. Our findings demonstrate that further research must be done in this area to better understand how humans interact with voice-AI, and how their attitudes mediate these interactions. The outcomes of this research has implications for the design of voice-AI products, and can point to areas where technology can improve to improve overall human experiences with virtual assistants.

References

2023

  1. Frontiers
    Comparing alignment toward American, British, and Indian English text-to-speech (TTS) voices: influence of social attitudes and talker guise
    N. DoddM. Cohn, and G. Zellou
    Frontiers in Computer Science, 2023