Human Interaction with Voice-AI

The use of voice artificial intelligence (voice-AI) – in the form of virtual assistants such as Amazon’s Alexa or Apple’s Siri – has grown increasingly common in recent years. Current research indicates that social factors that mediate linguistic accommodation during human–human communication, such as gender and likability, also mediate human–voice-AI communication; however, the magnitude of these effects differs from what we see in human-human communication. We further investigate the question of how human–voice-AI communication differs from human-human communication by testing whether variation in alignment toward speakers of different dialects is similar in apparent device and apparent human model talkers.

We conducted a shadowing task in which participants engaged in a question-and-answer “conversation” with text-to-speech (TTS) voices in British, American, and Indian dialects of English. To test the effect of a speaker’s top-down knowledge of the model talker’s guise, TTS voices were presented in either an authentic guise (i.e., as a device) or an inauthentic guise (i.e., as a human). A separate group of raters judged perceptual similarity between pre- and post-exposure productions and the model talkers’ production as an assessment of convergence. We find that participants converge more toward Indian English voices and diverge from American English voices; additionally, participants aligned more toward TTS voices when they were presented as human talkers, compared to when they were presented as device talkers. Our findings demonstrate that further research must be done in this area to better understand how humans interact with voice-AI, and how their attitudes mediate these interactions. The outcomes of this research has implications for the design of voice-AI products, and can point to areas where technology can improve to improve overall human experiences with virtual assistants.

Text-to-speech (TTS) voices, which vary in their apparent native language and dialect, are increasingly widespread. In this paper, we test how speakers perceive and align toward TTS voices that represent American, British, and Indian dialects of English and the extent that social attitudes shape patterns of convergence and divergence. We also test whether top-down knowledge of the talker, manipulated as a “human” or “device” guise, mediates these attitudes and accommodation. Forty-six American English-speaking participants completed identical interactions with 6 talkers (2 from each dialect) and rated each talker on a variety of social factors. Accommodation was assessed with AXB perceptual similarity by a separate group of raters. Results show that speakers had the strongest positive social attitudes toward the Indian English voices and converged toward them more. Conversely, speakers rate the American English voices as less human-like and diverge from them. Finally, speakers overall show more accommodation toward TTS voices that were presented in a “human” guise. We discuss these results through the lens of the Communication Accommodation Theory (CAT).

Human Interaction with Voice-AI

References

2023