Automatic Facial Expression Recognition and Analysis, in particular FACS Action Unit (AU) detection and discrete emotion detection, has been an active topic in computer science for over two decades. Standardisation and comparability has come some way; for instance, there exist a number of commonly used facial expression databases. However, lack of a common evaluation protocol and lack of sufficient details to reproduce the reported individual results make it difficult to compare systems to each other. This in turn hinders the progress of the field. A periodical challenge in Facial Expression Recognition and Analysis would allow this comparison in a fair manner. It would clarify how far the field has come, and would allow us to identify new goals, challenges and targets. In this paper we present the first challenge in automatic recognition of facial expressions to be held during the IEEE conference on Face and Gesture Recognition 2011, in Santa Barbara, California. Two sub-challenges are defined: one on AU detection and another on discrete emotion detection. It outlines the evaluation protocol, the data used, and the results of a baseline method for the two sub-challenges.
Neutral Face Estimation
Emotion recognition is a very active field of research. The Emotion Recognition In The Wild Challenge and Workshop (EmotiW) 2013 Grand Challenge consists of an audio-video based emotion classification challenges, which mimics real-world conditions. Traditionally, emotion recognition has been performed on laboratory controlled data. While undoubtedly worthwhile at the time, such laboratory controlled data poorly represents the environment and conditions faced in real-world situations. The goal of this Grand Challenge is to define a common platform for evaluation of emotion recognition methods in real-world conditions. The database in the 2013 challenge is the Acted Facial Expression in the Wild (AFEW), which has been collected from movies showing close-to-real-world conditions.
In this paper we present the techniques used for the University of Montréal's team submissions to the 2013 Emotion Recognition in the Wild Challenge. The challenge is to classify the emotions expressed by the primary human subject in short video clips extracted from feature length movies. This involves the analysis of video clips of acted scenes lasting approximately one-two seconds, including the audio track which may contain human voices as well as background music. Our approach combines multiple deep neural networks for different data modalities, including: (1) a deep convolutional neural network for the analysis of facial expressions within video frames; (2) a deep belief net to capture audio information; (3) a deep autoencoder to model the spatio-temporal information produced by the human actions depicted within the entire scene; and (4) a shallow network architecture focused on extracted features of the mouth of the primary human subject in the scene. We discuss each of these techniques, their performance characteristics and different strategies to aggregate their predictions. Our best single model was a convolutional neural network trained to predict emotions from static frames using two large data sets, the Toronto Face Database and our own set of faces images harvested from Google image search, followed by a per frame aggregation strategy that used the challenge training data. This yielded a test set accuracy of 35.58%. Using our best strategy for aggregating our top performing models into a single predictor we were able to produce an accuracy of 41.03% on the challenge test set. These compare favorably to the challenge baseline test set accuracy of 27.56%.
In this paper, we propose a method for video-based human emotion recognition. For each video clip, all frames are represented as an image set, which can be modeled as a linear subspace to be embedded in Grassmannian manifold. After feature extraction, Class-specific One-to-Rest Partial Least Squares (PLS) is learned on video and audio data respectively to distinguish each class from the other confusing ones. Finally, an optimal fusion of classifiers learned from both modalities (video and audio) is conducted at decision level. Our method is evaluated on the Emotion Recognition In The Wild Challenge (EmotiW 2013). The experimental results on both validation set and blind test set are presented for comparison. The final accuracy achieved on test set outperforms the baseline by 26%.
Automatic facial point detection plays arguably the most important role in face analysis. Several methods have been proposed which reported their results on databases of both constrained and unconstrained conditions. Most of these databases provide annotations with different mark-ups and in some cases the are problems related to the accuracy of the fiducial points. The aforementioned issues as well as the lack of a evaluation protocol makes it difficult to compare performance between different systems. In this paper, we present the 300 Faces in-the-Wild Challenge: The first facial landmark localization Challenge which is held in conjunction with the International Conference on Computer Vision 2013, Sydney, Australia. The main goal of this challenge is to compare the performance of different methods on a new-collected dataset using the same evaluation protocol and the same mark-up and hence to develop the first standardized benchmark for facial landmark localization.
Non-frontal view facial expression recognition is important in many scenarios where the frontal view face images may not be available. However, few work on this issue has been done in the past several years because of its technical challenges and the lack of appropriate databases. Recently, a 3D facial expression database (BU-3DFE database) is collected by Yin et al.  and has attracted some researchers to study this issue. Based on the BU-3DFE database, in this paper we propose a novel approach to expression recognition from non-frontal view facial images. The novelty of the proposed method lies in recognizing the multi-view expressions under the unified Bayes theoretical framework, where the recognition problem can be formulated as an optimization problem of minimizing an upper bound of Bayes error. We also propose a close-form solution method based on the power iteration approach and rank-one update (ROU) technique to find the optimal solutions of the proposed method. Extensive experiments on BU-3DFE database with 100 subjects and 5 yaw rotation view angles demonstrate the effectiveness of our method.
Most previous facial expression analysis works only focused on expression recognition.In this paper,we propose a novel framework of facial expression analysis based on the ranking model.Different from previous works,it not only can do facial expression recognition,but also could estimate the intensity of facial expression,which is very important to further understand human emotion.Although it is hard to label expression intensity quantitatively,the ordinal relationship in temporal domain is actually a good relative measurement.Based on this observation,we convert the problem of intensity estimation to a ranking problem,which is modeled by the RankBoost.The output ranking score can be directly used for intensity estimation,and we also extend the ranking function for expression recognition.In order to further improve the performance,we propose to introduce l1 based regularization into the Rankboost.Experiments on the Cohn-Kanade database show that the proposed method has a promising performance compared to the state-of-theart.
In this paper, two novel methods for facial expression recognition in facial image sequences are presented. The user has to manually place some of Candide grid nodes to face landmarks depicted at the first frame of the image sequence under examination. The grid-tracking and deformation system used, based on deformable models, tracks the grid in consecutive video frames over time, as the facial expression evolves, until the frame that corresponds to the greatest facial expression intensity. The geometrical displacement of certain selected Candide nodes, defined as the difference of the node coordinates between the first and the greatest facial expression intensity frame, is used as an input to a novel multiclass Support Vector Machine (SVM) system of classifiers that are used to recognize either the six basic facial expressions or a set of chosen Facial Action Units (FAUs). The results on the Cohn-Kanade database show a recognition accuracy of 99.7% for facial expression recognition using the proposed multiclass SVMs and 95.1% for facial expression recognition based on FAU detection
In this paper, an analysis of the effect of partial occlusion on facial expression recognition is investigated. The classification from partially occluded images in one of the six basic facial expressions is performed using a method based on Gabor wavelets texture information extraction, a supervised image decomposition method based on Discriminant Non-negative Matrix Factorization and a shape-based method that exploits the geometrical displacement of certain facial features. We demonstrate how partial occlusion affects the above mentioned methods in the classification of the six basic facial expressions, and indicate the way partial occlusion affects human observers when recognizing facial expressions. An attempt to specify which part of the face (left, right, lower or upper region) contains more discriminant information for each facial expression, is also made and conclusions regarding the pairs of facial expressions misclassifications that each type of occlusion introduces, are drawn.
Enabling computer systems to recognize facial expressions and infer emotions from them in real time presents a challenging research topic. In this paper, we present a real time approach to emotion recognition through facial expression in live video. We employ an automatic facial feature tracker to perform face localization and feature extraction. The facial feature displacements in the video stream are used as input to a Support Vector Machine classifier. We evaluate our method in terms of recognition accuracy for a variety of interaction and classification scenarios. Our person-dependent and person-independent experiments demonstrate the effectiveness of a support vector machine and feature tracking approach to fully automatic, unobtrusive expression recognition in live video. We conclude by discussing the relevance of our work to affective and intelligent man-machine interfaces and exploring further improvements.
We present a systematic comparison of machine learning methods applied to the problem of fully automatic recognition of facial expressions, including AdaBoost, support vector machines, and linear discriminant analysis. Each video-frame is first scanned in real-time to detect approximately upright-frontal faces. The faces found are scaled into image patches of equal size, convolved with a bank of Gabor energy filters, and then passed to a recognition engine that codes facial expressions into 7 dimensions in real time: neutral, anger, disgust, fear, joy, sadness, surprise. We report results on a series of experiments comparing spatial frequency ranges, feature selection techniques, and recognition engines. Best results were obtained by selecting a subset of Gabor filters using AdaBoost and then training Support Vector Machines on the outputs of the filters selected by AdaBoost. The generalization performance to new subjects for a 7-way forced choice was 93% or more correct on two publicly available datasets, the best performance reported so far on these datasets. The outputs of the classifier change smoothly as a function of time and thus can be used for unobtrusive expression dynamics capture. We developed an end-to-end system that provides facial expression codes at 24 frames per second and animates a computer-generated character. In real-time this expression mirror operates down to resolutions of 16 pixels from eye to eye. We also applied the system to fully automated facial action coding.
Reliable detection of ordinary facial expressions (e.g. smile) despite the variability among individuals as well as face appearance is an important step toward the realization of perceptual user interface with autonomous perception of persons. We describe a rule-based algorithm for robust facial expression recognition combined with robust face detection using a convolutional neural network. In this study, we address the problem of subject independence as well as translation, rotation, and scale invariance in the recognition of facial expression. The result shows reliable detection of smiles with recognition rate of 97.6% for 5600 still images of more than 10 subjects. The proposed algorithm demonstrated the ability to discriminate smiling from talking based on the saliency score obtained from voting visual cues. To the best of our knowledge, it is the first facial expression recognition model with the property of subject independence combined with robustness to variability in facial appearance.
The video analysis system described in this paper aims at facial expression recognition consistent with the MPEG4 standardized parameters for facial animation, FAP. For this reason, two levels of analysis are necessary: low-level analysis to extract the MPEG4 compliant parameters and high-level analysis to estimate the expression of the sequence using these low-level parameters. The low-level analysis is based on an improved active contour algorithm that uses high level information based on principal component analysis to locate the most significant contours of the face (eyebrows and mouth), and on motion estimation to track them. The high-level analysis takes as input the FAP produced by the low-level analysis tool and, by means of a Hidden Markov Model classifier, detects the expression of the sequence.