Kinship verification from facial appearance is a difficult problem. This paper explores the possibility of employing facial expression dynamics in this problem. By using features that describe facial dynamics and spatio-temporal appearance over smile expressions, we show that it is possible to improve the state of the art in this problem, and verify that it is indeed possible to recognize kinship by resemblance of facial expressions. The proposed method is tested on different kin relationships. On the average, 72.89% verification accuracy is achieved on spontaneous smiles.
In this work, we propose a dynamic texture-based approach to the recognition of facial Action Units (AUs, atomic facial gestures) and their temporal models (i.e., sequences of temporal segments: neutral, onset, apex, and offset) in near-frontal-view face videos. Two approaches to modeling the dynamics and the appearance in the face region of an input video are compared: an extended version of Motion History Images and a novel method based on Nonrigid Registration using Free-Form Deformations (FFDs). The extracted motion representation is used to derive motion orientation histogram descriptors in both the spatial and temporal domain. Per AU, a combination of discriminative, frame-based GentleBoost ensemble learners and dynamic, generative Hidden Markov Models detects the presence of the AU in question and its temporal segments in an input image sequence. When tested for recognition of all 27 lower and upper face AUs, occurring alone or in combination in 264 sequences from the MMI facial expression database, the proposed method achieved an average event recognition accuracy of 89.2 percent for the MHI method and 94.3 percent for the FFD method. The generalization performance of the FFD method has been tested using the Cohn-Kanade database. Finally, we also explored the performance on spontaneous expressions in the Sensitive Artificial Listener data set.
In this work we report on the progress of building a system that enables fully automated fast and robust facial expression recognition from face video. We analyse subtle changes in facial expression by recognizing facial muscle action units (AUs) and analysing their temporal behavior. By detecting AUs from face video we enable the analysis of various facial communicative signals including facial expressions of emotion, attitude and mood. For an input video picturing a facial expression we detect per frame whether any of 15 different AUs is activated, whether that facial action is in the onset, apex, or offset phase, and what the total duration of the activation in question is. We base this process upon a set of spatio-temporal features calculated from tracking data for 20 facial fiducial points. To detect these 20 points of interest in the first frame of an input face video, we utilize a fully automatic, facial point localization method that uses individual feature GentleBoost templates built from Gabor wavelet features. Then, we exploit a particle filtering scheme that uses factorized likelihoods and a novel observation model that combines a rigid and a morphological model to track the facial points. The AUs displayed in the input video and their temporal segments are recognized finally by Support Vector Machines trained on a subset of most informative spatio-temporal features selected by AdaBoost. For Cohn-Kanade andMMI databases, the proposed system classifies 15 AUs occurring alone or in combination with other AUs with a mean agreement rate of 90.2% with human FACS coders.
In this paper, we present a new idea to analyze facial expression by exploring some common and specific information among different expressions. Inspired by the observation that only a few facial parts are active in expression disclosure (e.g., around mouth, eye), we try to discover the common and specific patches which are important to discriminate all the expressions and only a particular expression, respectively. A two-stage multi-task sparse learning (MTSL) framework is proposed to efficiently locate those discriminative patches. In the first stage MTSL, expression recognition tasks, each of which aims to find dominant patches for each expression, are combined to located common patches. Second, two related tasks, facial expression recognition and face verification tasks, are coupled to learn specific facial patches for individual expression. Extensive experiments validate the existence and significance of common and specific patches. Utilizing these learned patches, we achieve superior performances on expression recognition compared to the state-of-the-arts.
Automatic facial action unit (AU) detection from video is a long-standing problem in computer vision. Two main approaches have been pursued: (1) static modeling — typically posed as a discriminative classification problem in which each video frame is evaluated independently; (2) temporal modeling — frames are segmented into sequences and typically modeled with a variant of dynamic Bayesian networks. We propose a segment-based approach, kSeg-SVM, that incorporates benefits of both approaches and avoids their limitations. kSeg-SVM is a temporal extension of the spatial bag-of-words. kSeg-SVM is trained within a structured output SVM framework that formulates AU detection as a problem of detecting temporal events in a time series of visual features. Each segment is modeled by a variant of the BoW representation with soft assignment of the words based on similarity. Our framework has several benefits for AU detection: (1) both dependencies between features and the length of action units are modeled; (2) all possible segments of the video may be used for training; and (3) no assumptions are required about the underlying structure of the action unit events (e.g., i.i.d.). Our algorithm finds the best k-or-fewer segments that maximize the SVM score. Experimental results suggest that the proposed method outperforms state-of-the-art static methods for AU detection.
FACS (Facial Action Coding System) coding is the state of the art in manual measurement of facial actions. FACS coding, however, is labor intensive and difficult to standardize. A goal of automated FACS coding is to eliminate the need for manual coding and realize automatic recognition and analysis of facial actions. Success of this effort depends in part on access to reliably coded corpora; however, manual FACS coding remains expensive and slow. This paper proposes Fast-FACS, a computer vision aided system that improves speed and reliability of FACS coding. Three are the main novelties of the system: (1) to the best of our knowledge, this is the first paper to predict onsets and offsets from peaks, (2) use Active Appearance Models for computer assisted FACS coding, (3) learn an optimal metric to predict onsets and offsets from peaks. The system was tested in the RU-FACS database, which consists of natural facial behavior during a two-person interview. Fast-FACS reduced manual coding time by nearly 50% and demonstrated strong concurrent validity with manual FACS coding.
We propose a semi-supervised approach to solve the task of emotion recognition in 2D face images using recent ideas in deep learning for handling the factors of variation present in data. An emotion classification algorithm should be both robust to (1) remaining variations due to the pose of the face in the image after centering and alignment, (2) the identity or morphology of the face. In order to achieve this invariance, we propose to learn a hierarchy of features in which we gradually filter the factors of variation arising from both (1) and (2). We address (1) by using a multi-scale contractive convolutional network (CCNET) in order to obtain invariance to translations of the facial traits in the image. Using the feature representation produced by the CCNET, we train a Contractive Discriminative Analysis (CDA) feature extractor, a novel variant of the Contractive Auto-Encoder (CAE), designed to learn a representation separating out the emotion-related factors from the others (which mostly capture the subject identity, and what is left of pose after the CCNET). This system beats the state-of-the-art on a recently proposed dataset for facial expression recognition, the Toronto Face Database, moving the state-of-art accuracy from 82.4% to 85.0%, while the CCNET and CDA improve accuracy of a standard CAE by 8%.
Recently, deep neural networks have been shown to perform competitively on the task of predicting facial expression from images. Trained by gradient-based methods, these networks are amenable to "multi-task" learning via a multiple term objective. In this paper we demonstrate that learning representations to predict the position and shape of facial landmarks can improve expression recognition from images. We show competitive results on two large-scale datasets, the ICML 2013 Facial Expression Recognition challenge, and the Toronto Face Database.
A training process for facial expression recognition is usually performed sequentially in three individual stages: feature learning, feature selection, and classifier construction. Extensive empirical studies are needed to search for an optimal combination of feature representation, feature set, and classifier to achieve good recognition performance. This paper presents a novel Boosted Deep Belief Network (BDBN) for performing the three training stages iteratively in a unified loopy framework. Through the proposed BDBN framework, a set of features, which is effective to characterize expression-related facial appearance/shape changes, can be learned and selected to form a boosted strong classifier in a statistical way. As learning continues, the strong classifier is improved iteratively and more importantly, the discriminative capabilities of selected features are strengthened as well according to their relative importance to the strong classifier via a joint fine-tune process in the BDBN framework. Extensive experiments on two public databases showed that the BDBN framework yielded dramatic improvements in facial expression analysis.
In this paper, multimodal learning for facial expression recognition (FER) is proposed. The multimodal learning method makes the first attempt to learn the joint representation by considering the texture and landmark modality of facial images, which are complementary with each other. In order to learn the representation of each modality and the correlation and interaction between different modalities, the structured regularization (SR) is employed to enforce and learn the modality-specific sparsity and density of each modality, respectively. By introducing SR, the comprehensiveness of the facial expression is fully taken into consideration, which can not only handle the subtle expression but also perform robustly to different input of facial images. With the proposed multimodal learning network, the joint representation learning from multimodal inputs will be more suitable for FER. Experimental results on the CK+ and NVIE databases demonstrate the superiority of our proposed method.
In this paper, we propose a new scheme to formulate the dynamic facial expression recognition problem as a longitudinal atlases construction and deformable groupwise image registration problem. The main contributions of this method include: 1) We model human facial feature changes during the facial expression process by a diffeomorphic image registration framework; 2) The subject-specific longitudinal change information of each facial expression is captured by building an expression growth model; 3) Longitudinal atlases of each facial expression are constructed by performing groupwise registration among all the corresponding expression image sequences of different subjects. The constructed atlases can reflect overall facial feature changes of each expression among the population, and can suppress the bias due to inter-personal variations. The proposed method was extensively evaluated on the Cohn-Kanade, MMI, and Oulu-CASIA VIS dynamic facial expression databases and was compared with several state-of-the-art facial expression recognition approaches. Experimental results demonstrate that our method consistently achieves the highest recognition accuracies among other methods under comparison on all the databases.
We present a novel framework for the recognition of facial expressions at arbitrary poses that is based on 2D geometric features. We address the problem by first mapping the 2D locations of landmark points of facial expressions in non-frontal poses to the corresponding locations in the frontal pose. Then, recognition of the expressions is performed by using any state-of-the-art facial expression recognition method (in our case, multi-class SVM). To learn the mappings that achieve pose normalization, we use a novel Gaussian Process Regression (GPR) model which we name Coupled Gaussian Process Regression (CGPR) model. Instead of learning single GPR model for all target pairs of poses at once, or learning one GPR model per target pair of poses independently of other pairs of poses, we propose CGPR model, which also models the couplings between the GPR models learned independently per target pairs of poses. To the best of our knowledge, the proposed method is the first one satisfying all: (i) being face-shape-model-free, (ii) handling expressive faces in the range from 45° to +45° pan rotation and from 30° to +30° tilt rotation, and (iii) performing accurately for continuous head pose despite the fact that the training was conducted only on a set of discrete poses.
In the last few years, Facial Expression Synthesis (FES) has been a flourishing area of research driven by applications in character animation, computer games, and human computer interaction. This paper proposes a photo-realistic FES method based on Bilinear Kernel Reduced Rank Regression (BKRRR). BKRRR learns a high-dimensional mapping between the appearance of a neutral face and a variety of expressions (e.g. smile, surprise, squint). There are two main contributions in this paper: (1) Propose BKRRR for FES. Several algorithms for learning the parameters of BKRRR are evaluated. (2) Propose a new method to preserve subtle person-specific facial characteristics (e.g. wrinkles, pimples). Experimental results on the CMU Multi-PIE database and pictures taken with a regular camera show the effectiveness of our approach.
We consider the task of labeling facial emotion intensities in videos, where the emotion intensities to be predicted have ordinal scales (e.g., low, medium, and high) that change in time. A significant challenge is that the rates of increase and decrease differ substantially across subjects. Moreover, the actual absolute differences of intensity values carry little information, with their relative order being more important. To solve the intensity prediction problem we propose a new dynamic ranking model that models the signal intensity at each time as a label on an ordinal scale and links the temporally proximal labels using dynamic smoothness constraints. This new model extends the successful static ordinal regression to a structured (dynamic) setting by using an analogy with Conditional Random Field (CRF) models in structured classification. We show that, although non-convex, the new model can be accurately learned using efficient gradient search. The predictions resulting from this dynamic ranking model show significant improvements over the regular CRFs, which fail to consider ordinal relationships between predicted labels. We also observe substantial improvements over static ranking models that do not exploit temporal dependencies of ordinal predictions. We demonstrate the benefits of our algorithm on the Cohn-Kanade dataset for the dynamic facial emotion intensity prediction problem and illustrate its performance in a controlled synthetic setting.
Emotion recognition from facial images is a very active research topic in human computer interaction (HCI). However, most of the previous approaches only focus on the frontal or nearly frontal view facial images. In contrast to the frontal/nearly-frontal view images, emotion recognition from non-frontal view or even arbitrary view facial images is much more difficult yet of more practical utility. To handle the emotion recognition problem from arbitrary view facial images, in this paper we propose a novel method based on the regional covariance matrix (RCM) representation of facial images. We also develop a new discriminant analysis theory, aiming at reducing the dimensionality of the facial feature vectors while preserving the most discriminative information, by minimizing an estimated multiclass Bayes error derived under the Gaussian mixture model (GMM). We further propose an efficient algorithm to solve the optimal discriminant vectors of the proposed discriminant analysis method. We render thousands of multi-view 2D facial images from the BU-3DFE database and conduct extensive experiments on the generated database to demonstrate the effectiveness of the proposed method. It is worth noting that our method does not require face alignment or facial landmark points localization, making it very attractive.
Facial micro-expressions are rapid involuntary facial expressions which reveal suppressed affect. To the best knowledge of the authors, there is no previous work that successfully recognises spontaneous facial micro-expressions. In this paper we show how a temporal interpolation model together with the first comprehensive spontaneous micro-expression corpus enable us to accurately recognise these very short expressions. We designed an induced emotion suppression experiment to collect the new corpus using a high-speed camera. The system is the first to recognise spontaneous facial micro-expressions and achieves very promising results that compare favourably with the human micro-expression detection accuracy.
Spatial-temporal relations among facial muscles carry crucial information about facial expressions yet have not been thoroughly exploited. One contributing factor for this is the limited ability of the current dynamic models in capturing complex spatial and temporal relations. Existing dynamic models can only capture simple local temporal relations among sequential events, or lack the ability for incorporating uncertainties. To overcome these limitations and take full advantage of the spatio-temporal information, we propose to model the facial expression as a complex activity that consists of temporally overlapping or sequential primitive facial events. We further propose the Interval Temporal Bayesian Network to capture these complex temporal relations among primitive facial events for facial expression modeling and recognition. Experimental results on benchmark databases demonstrate the feasibility of the proposed approach in recognizing facial expressions based purely on spatio-temporal relations among facial muscles, as well as its advantage over the existing methods.
We address the problem of editing facial expression in video, such as exaggerating, attenuating or replacing the expression with a different one in some parts of the video. To achieve this we develop a tensor-based 3D face geometry reconstruction method, which fits a 3D model for each video frame, with the constraint that all models have the same identity and requiring temporal continuity of pose and expression. With the identity constraint, the differences between the underlying 3D shapes capture only changes in expression and pose. We show that various expression editing tasks in video can be achieved by combining face reordering with face warping, where the warp is induced by projecting differences in 3D face shapes into the image plane. Analogously, we show how the identity can be manipulated while fixing expression and pose. Experimental results show that our method can effectively edit expressions and identity in video in a temporally-coherent way with high fidelity.
This paper presents a method to synthesize a realistic facial animation of a target person, driven by a facial performance video of another person. Different from traditional facial animation approaches, our system takes advantage of an existing facial performance database of the target person, and generates the final video by retrieving frames from the database that have similar expressions to the input ones. To achieve this we develop an expression similarity metric for accurately measuring the expression difference between two video frames. To enforce temporal coherence, our system employs a shortest path algorithm to choose the optimal image for each frame from a set of candidate frames determined by the similarity metric. Finally, our system adopts an expression mapping method to further minimize the expression difference between the input and retrieved frames. Experimental results show that our system can generate high quality facial animation using the proposed data-driven approach.
Large margin learning approaches, such as support vector machines (SVM), have been successfully applied to numerous classification tasks, especially for automatic facial expression recognition. The risk of such approaches however, is their sensitivity to large margin losses due to the influence from noisy training examples and outliers which is a common problem in the area of affective computing (i.e., manual coding at the frame level is tedious so coarse labels are normally assigned). In this paper, we leverage the relaxation of the parallel-hyperplanes constraint and propose the use of modified correlation filters (MCF). The MCF is similar in spirit to SVMs and correlation filters, but with the key difference of optimizing only a single hyperplane. We demonstrate the superiority of MCF over current techniques on a battery of experiments.
Automated facial expression recognition has received increased attention over the past two decades. Existing works in the field usually do not encode either the temporal evolution or the intensity of the observed facial displays. They also fail to jointly model multidimensional (multi-class) continuous facial behaviour data; binary classifiers - one for each target basic-emotion class - are used instead. In this paper, intrinsic topology of multidimensional continuous facial affect data is first modeled by an ordinal manifold. This topology is then incorporated into the Hidden Conditional Ordinal Random Field (H-CORF) framework for dynamic ordinal regression by constraining H-CORF parameters to lie on the ordinal manifold. The resulting model attains simultaneous dynamic recognition and intensity estimation of facial expressions of multiple emotions. To the best of our knowledge, the proposed method is the first one to achieve this on both deliberate as well as spontaneous facial affect data.