Reference Publications

Action Units

Pantic, M., & Rothkrantz, L. J. M. 2000. Automatic analysis of facial expressions: The state of the art. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(12), 1424-1445.

  Humans detect and interpret faces and facial expressions in a scene with little or no effort. Still, development of an automated system that accomplishes this task is rather difficult. There are several related problems: detection of an image segment as a face, extraction of the facial expression information, and classification of the expression (e.g., in emotion categories). A system that performs these operations accurately and in real time would form a big step in achieving a human-like interaction between man and machine. The paper surveys the past work in solving these problems. The capability of the human visual system with respect to these problems is discussed, too. It is meant to serve as an ultimate goal and a guide for determining recommendations for development of an automatic facial expression analyzer.

Friesen, E., & Ekman, P. 1978. Facial action coding system: a technique for the measurement of facial movement. Palo Alto.

  BOOK

Hager, J. C., Ekman, P., & Friesen, W. V. 2002. Facial action coding system.Salt Lake City, UT: A Human Face.

  BOOK

OSTER, H. (2003). Emotion in the Infant's Face. Annals of the New York Academy of Sciences, 1000(1), 197-204.

  Darwin viewed “experiments in nature” as an important strategy for elucidating the evolutionary bases of human emotional expressions. Infants with craniofacial anomalies are of special interest because morphological abnormalities and resulting distortions or deficits in their facial expressions could make it more difficult for caregivers to read and accurately interpret their signals. As part of a larger study on the effects of craniofacial anomalies on infant facial expression and parent-infant interaction, infants with different types of craniofacial conditions and comparison infants were videotaped interacting with their mothers at 3 and 6 months. The infants' facial expressions were coded with Baby FACS. Thirty-seven slides of 16 infants displaying 4 distinctive infant expressions (cry face, negative face, interest, and smile) were rated by 38 naive observers on a 7-point scale ranging from intense distress to intense happiness. Their ratings were significantly correlated with ratings based on objective Baby FACS criteria (r > 0.9 in all infant groups). A 4 (infant group) × 4 (expression category) ANOVA showed a significant main effect for expression category, F(3) 5 71.9, P 5 0.000, but no significant effect for infant group or group 3 expression interaction. The observers' ratings were thus highly “accurate” in terms of a priori Baby FACS criteria, even in the case of infants with severely disfiguring facial conditions. These findings demonstrate that the signal value of infant facial expressions is remarkably robust, suggesting that the capacity to read emotional meaning in infants' facial expressions may have a biological basis.

Ortony, A., & Turner, T. J. 1990. What's basic about basic emotions?.Psychological review, 97(3), 315.

  A widespread assumption in theories of emotion is that there exists a small set of basic emotions. From a biological perspective, this idea is manifested in the belief that there might be neurophysiological and anatomical substrates corresponding to the basic emotions. From a psychological perspective, basic emotions are often held to be the primitive building blocks of other, nonbasic emotions. The content of such claims is examined, and the results suggest that there is no coherent nontrivial notion of basic emotions as the elementary psychological primitives in terms of which other emotions can be explained. Thus, the view that there exist basic emotions out of which all other emotions are built, and in terms of which they can be explained, is questioned, raising the possibility that this position is an article of faith rather than an empirically or theoretically defensible basis for the conduct of emotion research. This suggests that perhaps the notion of basic emotions will not lead to significant progress in the field. An alternative approach to explaining the phenomena that appear to motivate the postulation of basic emotions is presented.

Ekman, P., & Friesen, W. V. 1975. Unmasking the face: A guide to recognizing emotions from facial cues.

  BOOK

Parrott, W. G. (Ed.). 2001. Emotions in social psychology: Essential readings. Psychology Press.

  BOOK

Baron-Cohen, S., Golan, O., Wheelwright, S., & Hill, J. 2004. Mind reading: The interactive guide to emotions. London: Jessica Kingsley.

  BOOK

Reed, L. I., Sayette, M. A., & Cohn, J. F. 2007. Impact of depression on response to comedy: a dynamic facial coding analysis. Journal of abnormal psychology, 116(4), 804.

  Individuals suffering from depression show diminished facial responses to positive stimuli. Recent cognitive research suggests that depressed individuals may appraise emotional stimuli differently than do nondepressed persons. Prior studies do not indicate whether depressed individuals respond differently when they encounter positive stimuli that are difficult to avoid. The authors investigated dynamic responses of individuals varying in both history of major depressive disorder (MDD) and current depressive symptomatology (N = 116) to robust positive stimuli. The Facial Action Coding System (Ekman & Friesen, 1978) was used to measure affect-related responses to a comedy clip. Participants reporting current depressive symptomatology were more likely to evince affect-related shifts in expression following the clip than were those without current symptomatology. This effect of current symptomatology emerged even when the contrast focused only on individuals with a history of MDD. Specifically, persons with current depressive symptomatology were more likely than those without current symptomatology to control their initial smiles with negative affect-related expressions. These findings suggest that integration of emotion science and social cognition may yield important advances for understanding depression

LintsMartindale, A. C., Hadjistavropoulos, T., Barber, B., & Gibson, S. J. 2007. A psychophysical investigation of the facial action coding system as an index of pain variability among older adults with and without Alzheimer’s disease. Pain Medicine, 8(8), 678-689.

  Objective. Reflexive responses to pain such as facial reactions become increasingly important for pain assessment among patients with Alzheimer’s disease (AD) because self-report capabilities diminish as cognitive abilities decline. Our goal was to study facial expressions of pain in patients with and without AD. Design. We employed a quasi-experimental design and used the Facial Action Coding System (FACS) to assess reflexive facial responses to noxious stimuli of varied intensity. Two different modalities of stimulation (mechanical and electrical) were employed. Results. The FACS identified differences in facial expression as a function of level of discomforting stimulation. As expected, there were no significant differences based on disease status (AD vs control group). Conclusions. This is the first study to discriminate among FACS measures collected during innocuous and graded levels of precisely measured painful stimuli in seniors with (mild) dementia and in healthy control group participants. We conclude that, as hypothesized, FACS can be used for the assessment of evoked pain, regardless of the presence of AD.

Sebe, N. 2005. Machine learning in computer vision (Vol. 29). Springer Science & Business Media.

  BOOK

Bettadapura, V. 2012. Face expression recognition and analysis: the state of the art. arXiv preprint arXiv:1203.6722.

  The automatic recognition of facial expressions has been an active research topic since the early nineties. There have been several advances in the past few years in terms of face detection and tracking, feature extraction mechanisms and the techniques used for expression classification. This paper surveys some of the published work since 2001 till date. The paper presents a time-line view of the advances made in this field, the applications of automatic face expression recognizers, the characteristics of an ideal system, the databases that have been used and the advances made in terms of their standardization and a detailed summary of the state of the art. The paper also discusses facial parameterization using FACS Action Units (AUs) and MPEG-4 Facial Animation Parameters (FAPs) and the recent advances in face detection, tracking and feature extraction methods. Notes have also been presented on emotions, expressions and facial features, discussion on the six prototypic expressions and the recent studies on expression classifiers. The paper ends with a note on the challenges and the future work. This paper has been written in a tutorial style with the intention of helping students and researchers who are new to this field.

Huang, Dong, and Fernando De La Torre. "Facial action transfer with personalized bilinear regression." Computer Vision–ECCV 2012. Springer Berlin Heidelberg, 2012. 144-158.

  Facial Action Transfer (FAT) has recently attracted much attention in computer vision due to its diverse applications in the movie industry, computer games, and privacy protection. The goal of FAT is to “clone” the facial actions from the videos of one person (source) to another person (target). In this paper, we will assume that we have a video of the source person but only one frontal image of the target person. Most successful methods for FAT require a training set with annotated correspondence between expressions of different subjects, sometimes including many images of the target subject. However, labeling expressions is time consuming and error prone (i.e., it is difficult to capture the same intensity of the expression across people). Moreover, in many applications it is not realistic to have many labeled images of the target. This paper proposes a method to learn a personalized facial model, that can produce photo-realistic person-specific facial actions (e.g., synthesize wrinkles for smiling), from only a neutral image of the target person. More importantly, our learning method does not need an explicit correspondence of expressions across subjects. Experiments on the Cohn-Kanade and the RU-FACS databases show the effectiveness of our approach to generate video-realistic images of the target person driven by spontaneous facial actions of the source. Moreover, we illustrate applications of FAT to face de-identification.

Zeng, Jiabei, et al. "Confidence preserving machine for facial action unit detection." Proceedings of the IEEE International Conference on Computer Vision. 2015.

  Varied sources of error contribute to the challenge of facial action unit detection. Previous approaches address specific and known sources. However, many sources are unknown. To address the ubiquity of error, we propose a Confident Preserving Machine (CPM) that follows an easy-to-hard classification strategy. During training, CPM learns two confident classifiers. A confident positive classifier separates easily identified positive samples from all else; a confident negative classifier does same for negative samples. During testing, CPM then learns a person-specific classifier using ``virtual labels'' provided by confident classifiers. This step is achieved using a quasi-semi-supervised (QSS) approach. Hard samples are typically close to the decision boundary, and the QSS approach disambiguates them using spatio-temporal constraints. To evaluate CPM, we compared it with a baseline single-margin classifier and state-of-the-art semi-supervised learning, transfer learning, and boosting methods in three datasets of spontaneous facial behavior. With few exceptions, CPM outperformed baseline and state-of-the art methods.

Ding, Xiaoyu, et al. "Facial action unit event detection by cascade of tasks." Proceedings of the IEEE International Conference on Computer Vision. 2013.

  Automatic facial Action Unit (AU) detection from video is a long-standing problem in facial expression analysis. AU detection is typically posed as a classification problem between frames or segments of positive examples and negative ones, where existing work emphasizes the use of different features or classifiers. In this paper, we propose a method called Cascade of Tasks (CoT) that combines the use of different tasks (i.e., frame, segment and transition) for AU event detection. We train CoT in a sequential manner embracing diversity, which ensures robustness and generalization to unseen data. In addition to conventional framebased metrics that evaluate frames independently, we propose a new event-based metric to evaluate detection performance at event-level. We show how the CoT method consistently outperforms state-of-the-art approaches in both frame-based and event-based metrics, across three public datasets that differ in complexity: CK+, FERA and RUFACS.

Wang, Ziheng, et al. "Capturing global semantic relationships for facial action unit recognition." Proceedings of the IEEE International Conference on Computer Vision. 2013.

  In this paper we tackle the problem of facial action unit (AU) recognition by exploiting the complex semantic relationships among AUs, which carry crucial top-down information yet have not been thoroughly exploited. Towards this goal, we build a hierarchical model that combines the bottom-level image features and the top-level AU relationships to jointly recognize AUs in a principled manner. The proposed model has two major advantages over existing methods. 1) Unlike methods that can only capture local pair-wise AU dependencies, our model is developed upon the restricted Boltzmann machine and therefore can exploit the global relationships among AUs. 2) Although AU relationships are influenced by many related factors such as facial expressions, these factors are generally ignored by the current methods. Our model, however, can successfully capture them to more accurately characterize the AU relationships. Efficient learning and inference algorithms of the proposed model are also developed. Experimental results on benchmark databases demonstrate the effectiveness of the proposed approach in modelling complex AU relationships as well as its superior AU recognition performance over existing approaches.

Kaltwang, Sebastian, Sinisa Todorovic, and Maja Pantic. "Latent trees for estimating intensity of facial action units." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

  This paper is about estimating intensity levels of Facial Action Units (FAUs) in videos as an important and challenging step toward interpreting facial expressions. To address uncertainty in detections of facial landmark points, used as out input features, we formulate a new generative framework comprised of a graphical model, inference, and algorithms for learning both model parameters and structure. Our model is a latent tree (LT) that represents input features of facial landmark points and FAU intensities as leaf nodes, and encodes their higher-order dependencies with latent nodes at tree levels closer to the root. No other restrictions are placed on the model structure beyond that it is a tree. We specify a new algorithm for efficient learning of model structure that iteratively builds LT by adding either new edge or new hidden node to LT, whichever of these two graph-edit operations gives the highest increase of the joint likelihood. Our structure learning efficiently computes likelihood increase and selects an optimal graph revision without considering all possible structural changes. For FAU intensity estimation, we derive closed-form expressions of posterior marginals of all variables in LT, and specify an efficient inference of in two passes -- bottom-up and top-down. Our evaluation on the benchmark DISFA and ShoulderPain datasets, in subject-independent setting, demonstrate that we outperform the state of the art, even in the presence of significant noise in locations of facial landmark points. We demonstrate our correct learning of model structure by probabilistically sampling facial landmark points, conditioned on a given FAU intensity, and thus generating plausible facial expressions.

Zhao, Kaili, et al. "Joint patch and multi-label learning for facial action unit detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

  The face is one of the most powerful channel of non-verbal communication. The most commonly used taxonomy to describe facial behaviour is the Facial Action Coding System (FACS). FACS segments the visible effects of facial muscle activation into 30+ action units (AUs). AUs, which may occur alone and in thousands of combinations, can describe nearly all-possible facial expressions. Most existing methods for automatic AU detection treat the problem using one-vs-all classifiers and fail to exploit dependencies among AU and facial features. We introduce joint-patch and multi-label learning (JPML) to address these issues. JPML leverages group sparsity by selecting a sparse subset of facial patches while learning a multi-label classifier. In four of five comparisons on three diverse datasets, CK+, GFT, and BP4D, JPML produced the highest average F1 scores in comparison with state-of-the art.

Chu, Wen-Sheng, Fernando De la Torre, and Jeffery F. Cohn. "Selective transfer machine for personalized facial action unit detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013.

  Automatic facial action unit (AFA) detection from video is a long-standing problem in facial expression analysis. Most approaches emphasize choices of features and classifiers. They neglect individual differences in target persons. People vary markedly in facial morphology (e.g., heavy versus delicate brows, smooth versus deeply etched wrinkles) and behavior. Individual differences can dramatically influence how well generic classifiers generalize to previously unseen persons. While a possible solution would be to train person-specific classifiers, that often is neither feasible nor theoretically compelling. The alternative that we propose is to personalize a generic classifier in an unsupervised manner (no additional labels for the test subjects are required). We introduce a transductive learning method, which we refer to Selective Transfer Machine (STM), to personalize a generic classifier by attenuating person-specific biases. STM achieves this effect by simultaneously learning a classifier and re-weighting the training samples that are most relevant to the test subject. To evaluate the effectiveness of STM, we compared STM to generic classifiers and to cross-domain learning methods in three major databases: CK+ [20], GEMEP-FERA [32] and RU-FACS [2]. STM outperformed generic classifiers in all.

Ding, X., Chu, W. S., Torre, F. D. L., Cohn, J. F., & Wang, Q. 2013. Facial action unit event detection by cascade of tasks. InComputer Vision (ICCV), 2013 IEEE International Conference on (pp. 2400-2407). IEEE.

  Automatic facial Action Unit (AU) detection from video is a long-standing problem in facial expression analysis. AU detection is typically posed as a classification problem between frames or segments of positive examples and negative ones, where existing work emphasizes the use of different features or classifiers. In this paper, we propose a method called Cascade of Tasks (CoT) that combines the use of different tasks (i.e., frame, segment and transition) for AU event detection. We train CoT in a sequential manner embracing diversity, which ensures robustness and generalization to unseen data. In addition to conventional framebased metrics that evaluate frames independently, we propose a new event-based metric to evaluate detection performance at event-level. We show how the CoT method consistently outperforms state-of-the-art approaches in both frame-based and event-based metrics, across three public datasets that differ in complexity: CK+, FERA and RUFACS.

Valstar, M., & Pantic, M. 2006. Fully automatic facial action unit detection and temporal analysis. In Computer Vision and Pattern Recognition Workshop, 2006. CVPRW'06. Conference on (pp. 149-149). IEEE.

  In this work we report on the progress of building a system that enables fully automated fast and robust facial expression recognition from face video. We analyse subtle changes in facial expression by recognizing facial muscle action units (AUs) and analysing their temporal behavior. By detecting AUs from face video we enable the analysis of various facial communicative signals including facial expressions of emotion, attitude and mood. For an input video picturing a facial expression we detect per frame whether any of 15 different AUs is activated, whether that facial action is in the onset, apex, or offset phase, and what the total duration of the activation in question is. We base this process upon a set of spatio-temporal features calculated from tracking data for 20 facial fiducial points. To detect these 20 points of interest in the first frame of an input face video, we utilize a fully automatic, facial point localization method that uses individual feature GentleBoost templates built from Gabor wavelet features. Then, we exploit a particle filtering scheme that uses factorized likelihoods and a novel observation model that combines a rigid and a morphological model to track the facial points. The AUs displayed in the input video and their temporal segments are recognized finally by Support Vector Machines trained on a subset of most informative spatio-temporal features selected by AdaBoost. For Cohn-Kanade andMMI databases, the proposed system classifies 15 AUs occurring alone or in combination with other AUs with a mean agreement rate of 90.2% with human FACS coders.

Chu, W. S., De la Torre, F., & Cohn, J. F. 2013. Selective transfer machine for personalized facial action unit detection. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on (pp. 3515-3522). IEEE.

  Automatic facial action unit (AFA) detection from video is a long-standing problem in facial expression analysis. Most approaches emphasize choices of features and classifiers. They neglect individual differences in target persons. People vary markedly in facial morphology (e.g., heavy versus delicate brows, smooth versus deeply etched wrinkles) and behavior. Individual differences can dramatically influence how well generic classifiers generalize to previously unseen persons. While a possible solution would be to train person-specific classifiers, that often is neither feasible nor theoretically compelling. The alternative that we propose is to personalize a generic classifier in an unsupervised manner (no additional labels for the test subjects are required). We introduce a transductive learning method, which we refer to Selective Transfer Machine (STM), to personalize a generic classifier by attenuating person-specific biases. STM achieves this effect by simultaneously learning a classifier and re-weighting the training samples that are most relevant to the test subject. To evaluate the effectiveness of STM, we compared STM to generic classifiers and to cross-domain learning methods in three major databases: CK+ [20], GEMEP-FERA [32] and RU-FACS [2]. STM outperformed generic classifiers in all.

Simon, T., Nguyen, M. H., De La Torre, F., & Cohn, J. F. 2010. Action unit detection with segment-based SVMs. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on (pp. 2737-2744). IEEE.

  Automatic facial action unit (AU) detection from video is a long-standing problem in computer vision. Two main approaches have been pursued: (1) static modeling — typically posed as a discriminative classification problem in which each video frame is evaluated independently; (2) temporal modeling — frames are segmented into sequences and typically modeled with a variant of dynamic Bayesian networks. We propose a segment-based approach, kSeg-SVM, that incorporates benefits of both approaches and avoids their limitations. kSeg-SVM is a temporal extension of the spatial bag-of-words. kSeg-SVM is trained within a structured output SVM framework that formulates AU detection as a problem of detecting temporal events in a time series of visual features. Each segment is modeled by a variant of the BoW representation with soft assignment of the words based on similarity. Our framework has several benefits for AU detection: (1) both dependencies between features and the length of action units are modeled; (2) all possible segments of the video may be used for training; and (3) no assumptions are required about the underlying structure of the action unit events (e.g., i.i.d.). Our algorithm finds the best k-or-fewer segments that maximize the SVM score. Experimental results suggest that the proposed method outperforms state-of-the-art static methods for AU detection.