Autism spectrum disorder (ASD) is a neurodevel opmental disorder affecting approximately1 out of 100 children worldwide1. Nuclear symptoms refer to persistent deficits in communication and social interaction, and restrictive and repetitive behavioral patterns, interests, or activities2. Besides standard methodology for ASD diagnosis (semi-structured interviews and question naires), there is an increasing interest in implementing techniques capable of providing objective assessments based on biometrics. In recent decades, due to the het erogeneous nature of the disorder, the interest in using implicit measures for the detection of objective biomark ers in ASD has been growing3. The advantage of these techniques lies in the possibility of capturing the explicit and implicit manifestations of the disorder. Moreover, these objective measures are helpful to design tailored interventions addressing individualized needs. Some of these techniques include electroencephalography, eye tracking, video analysis, video modeling, voice analysis, and event-related potentials3, which serve to “sensorize” the technological system. In other words, the technologi cal system is capable of recording and analyzing users’ psychophysiological responses (in real-time or offline) to subsequently guide the training/intervention. In this way, it can be defined as an intervention “system equipped with sensorization”.
ASD research is increasingly trending toward the imple mentation of new technologies over traditional methodolo gies. Several studies choose to adopt standard computer-based or mobile applications in the ASD field due totheir greater objectivity in stimuli administration4. Elsewhere, robotic technologies have been opted to strengthen the presence of an agent or character capable of commu nicating and moving in space4. A meta-analysis on the topic provided evidence that the most trained ability in ASD using new technologies is social skills5. Specifically, initiating a conversation, social conventions, responding to others, nonverbal behaviors, regulating emotions and reciprocity, and relationships, are key research targets. In addition to the use of mobile applications and robots, virtual reality (VR) has been suggested as an effective tool for the assessment and training of individuals with ASD6.The strength of VR relies on its safeness in pro viding controlled and realistic content, which enables transferring learning effects to real situations6. VR could be an economical alternative to the robot, considering the advantage of being able to convey richer and more ecological human-like interactions through the inclusion of virtual agents.
A virtual agent is a computer program that can take any form and appearance depending on how it is coded. Virtual humans exhibit human-like behaviors, such as speech, gestures, and movements; at a finer layer, they can get to exhibit emotions, planning, motivation, and memory7. By supervising the verbal and nonverbal communicative behaviors of virtual humans, besides ensuring a good level of realism, it is possible to induce socially relevant cogni tive processes in the user with whom they interact8.This sort of virtual agent with human-like appearances enables realistic human interactions in the virtual environment. Therefore, it supposes a very useful mean for analyzing social dynamics, which are associated with one of the most common ASD deficits8. Several types of deployments of these agents are possible. In particular, virtual humans can be included in virtual environments in a passive way by guiding the virtual experience (through standardized instructions, questions, or feedback), or in an active way, through responsiveness, which is the quality of being responsive. Virtual humans endowed with responsive ness have the ability to modify their verbal or nonverbal information with regard to the information gathered from the user. On a technical level, the user’s expressions (e.g., semantic utterances, prosodic elements, body movements) are processed and analyzed by computer programs as inputs, and the response (verbal or not) to yield is proposed and formulated by the virtual human. To ensure the accuracy and quality of the responsive feed back, it is necessary to train the computer model through multiple inputs and possible responses. Owing this feature, virtual human can be considered adaptive to different situations generated by the users. The application of re sponsive virtual humans can be leveraged to provide new directions for ASD therapy, given the potential capabilities to make treatments highly individualized4. Indeed, virtual humans can communicate using non-verbal cues, such as gesture, pointing, or eye gaze, and linguistic features, such as prosody cues, written text, and speech7. The latter allows the creation of a context similar to natural human language, due to the responsiveness they possess.
Shifting the focus to virtual humans capable of commu nicating through language-based modalities, two control modules have been mostly used to date. Language-based control modules allow the virtual agent to generate con versations based on the manner in which they are pro grammed. Response-retrieval dialogue method is based on a predefined and recorded set of verbal or written information that the system retrieves in response to user’s input9.On the other hand, the response-generation method relies the response on the probability distribution learned from the training data, consistent with the user’s input9.
This article aimed at providing an overview of the stud ies that have adopted a virtual agent having human-like appearances in the treatment and evaluation of ASD. The goal was to investigate the characteristics and capabilities that these virtual humans hold (particularly if they are able to converse), how they operate on the user, and to which purpose (i.e., assessment, intervention or training).
Materials and methods
The literature search was carried out through the Scopus electronic database using the following keywords: TITLE-ABS-KEY(“virtual human” OR “conversational agent” OR “virtual character” OR “avatar”) AND (“autism” OR “autistic spectrum disorder” OR “ASD”) AND (“diagnosis” OR “assessment” OR “training” OR “intervention” OR “clinical*”).Inclusion criteria addressed: a) research articles and conference articles pu blished during the last 10 years, b) ASD participants between 6 and 19 years old (or with an average of less than 19 y. o.), c) virtual agent with human features (e.g., eyes, mouth, etc.), d) technology-based intervention, training or comprehensive assessment on ASD, and e) virtual agent’s core feature fo cusing on training or evaluating the user. Exclusion criteria addressed: a) systematic reviews, book chapters, or meta-analyses, b) articles published in non-English languages, c) articles not reporting the age and number of participants, and d) articles not fully available.
Findings have been categorized into: virtual human in terface, virtual human appearance, system sensorization, virtual human responsivity, interaction module, trained ability, and participants’ age. The first category has been split into two possible interfaces of the virtual human: 2D or 3D. Regarding the appearance of the virtual human, it meant that it can be depicted only at the face and torso level, full-body, or in both modes. A sensorization category was included to address all those possible implicit (i.e., psychophysiological) and explicit measures (i.e., behav ioral response, semantics) regarding the user’s inputs that were recorded by the system during the experience. The virtual human responsivity referred to the degree to which it is able to be responsive to users and to adapt to their inputs. It has been intended that through the implicit and explicit data collected by the system, the virtual human may or may not be responsive. The interaction module category reported the extent to which the virtual human is able to naturally interact with the user. There were two op tions leveraging the degree of interactions: predetermined sentences (i.e., automatic and standardized sentences delivered at certain moments of the experience without considering the user’s input) when the virtual human did not naturally interact, and response generation method. Finally, the core ability investigated by the application tested in the study and the participant’s age was reported.
Results
Eleven out of eighty-eight papers accomplished the inclu sion criteria. The eleven selected articles were presented in Figure 1.
The studies involved 184 participants in total (of whom 33-18%- were children or teenagers with typical develop ment). Eight articles implemented a 2D virtual human; the remaining used a 3D representation. Six studies focused on presenting the upper part of the virtual human (face and trunk) two studies focused on the full body. Finally, three more studies implemented both modalities, and in two studies, full body. Regarding the degree of system sensorization, the different studies addressed several approaches: behavioral response (5), facial expressions (5), head orientation (3), semantics (3), prosody (3), eye movements (2), and motor movements (1). Among the eleven articles, only six implemented adaptive virtual humans characterized by responsivity. The degree of virtual human responsivity varied among the selected articles which involved facial expression (3), head orien tation (2), eye gaze (2), audiovisual prompts (1), finger pointing (1), and head nods (1). Focusing on those virtual humans capable of producing speech content, four of them deployed a type of speech interaction that did not follow natural conversational criteria as it utilized predetermined sentences. Three articles instead were programmed by the response generation dialogue method, thus capable of holding a conversation. Nine articles aimed to train social skills, one aimed to train hand-eye coordination, and one to train for job interviews referring to social integration. Finally, the participants’ mean age was under 13 y. o in four cases, and greater than 13 y.o. in seven cases.
Discussion
The objective of this article was to review recent studies introducing virtual agents with human-like features (i.e., virtual humans) for ASD treatment or diagnosis. Specifi cally, the most important properties in terms of system sensorization and agent features were summarized. The focus on virtual technologies stemmed from the notion that they offered effective methods for treating ASD deficits; in particular, the implementation of virtual humans in VR can be useful for both verbal and nonverbal individuals due to the enhanced feeling of realistic presence in social interactions8.
Regarding the interface and appearance of the virtual human in the selected articles, the analysis opened an important issue of this research trend, which is the confusion in the definition of specific technical terms. For instance, this limitation occurred in Genova et al. (2021), in which the definition of “virtual human” is placed in the context of Virtual Reality Job Interview Training (VR-JIT). Other literature stated that VR-JIT systems did not use drawn virtual humans but used videos of real people8 instead. It would be appropri ate, therefore, to yield a consensus on the conceptual definition of what a virtual human is and whether and how it differs from other terms such as virtual agent or avatar. In this paper, a potential definition of what is a virtual human is proposed.
The results proved that most of the studies choose to sensorize the developed system to monitor behav ioral responses, facial expressions, head orientation, semantics, prosody, and eye movements10-19. This is consistent with the investigated ability, mostly social skills, and with the type of responsiveness of the virtual human. The degree of system sensorization, functional to the responsive adaptation of virtual humans, tended to work in six articles with real-time techniques. On the contrary, in the rest of the articles, the virtual human did not adapt in real-time the response depending on user’s biometric inputs. The real-time adaptation allowed re sponse patterns to vary based on the input (behavioral or physical) data acquired, and to return individualized output in real-time4. The peculiarity of virtual humans lied in their ability to modify the information they trans mit based on the user’s recorded data. The degree of virtual human responsivity varied among the selected articles depending on the acquired information. In this regard, by modifying their own facial, head, and gaze characteristics, such virtual humans were able to send prompts to the user10,11,13,16,16,20. For example, a study trained joint attention in ASD individuals through a game-based system in which the virtual human adapted the gaze and head orientation with the aim of improv ing the user’s performance13. Another example was the case of “Zeca”, a virtual human that was used with the goal of having the user copy the virtual humans’ facial expressions11. The basic idea was that through a model by imitation, ASD individuals could train facial move ments, expressions, and identify emotions.
Regarding the interaction module category, virtual humans presented in the articles can be divided into three categories: 1) those that did not present conversa tional content11,13,19,20, 2) those that have implemented predetermined sentences that are reproduced automati cally regardless of user’s input10,12,15,16, and 3) those that were programmed via artificial intelligence (AI) able to modulate their speech according to the training data (i.e., response generation method)14,17,18. The latter approach is the most sophisticated of those presented so far in this paper. Response generation refers to a dialogue-oriented chat capturing implicit and/or explicit user data with the aim of generating a coherent response through probability calculations9.Response generation is closely related to natural language processing (NLP), which is the field of AI dealing with human language processing. For example, response generation can use NLP to analyze the structure and meaning of sentences in the input and use this information to generate a coherent and relevant response. In short, response generation depends on the NLP to function properly. Technologies that use machine learning algorithms to generate consistent responses in clude: chat bots, virtual assistance systems, and automatic translation systems.
Future research should implement more sophisticated software version like, for example, Generative Pre-trained Transformer (GPT). GPT is a language model developed by Open AI using response generation that employs deep learning techniques to analyze and understand human language to parse the input and generate an appropriate response. Besides being more modern and stable, GPT, and particularly GPT-3, which is the latest version, guar antees a high sense of realism in the natural language of the virtual agent. A further option would be the Megatron- Turing Natural Language Generation model (MT-NLG), but it is not open source as GPT-3. The implementation of this type of software in a virtual human presented in VR would ensure a sense of realism and ecological validity in the treatment and diagnosis of ASD.
TAKE-HOME MESSAGE
Current knowledge
1. Responsive virtual humans adapt in real time their output according to the user input gathered.
2. Two language-based modalities have been mostly used to date: the response-retrieval and response-generation methods.
Contribution of the article
1. Overview of empirical evidence for the use of con versational responsive agents in ASD research.
2. Future directions within this research area have been outlined as scientific advances.