TEMPORAL ANALYSIS OF FACIAL EXPRESSIONS IN THE WILD USING HIGH LEVEL FACIAL ATTRIBUTES AND DEEP LEARNING ARCHITECTURE



Project Summary



 This project aims to develop new methods for temporal analysis of facial expressions of one or more people in a video recorded in the wild, using high level attributes and deep learning architecture. After the faces in the video frames are detected, each person’s facial landmarks and face tube will be extracted and a deep learning based pose normalization followed by binary and relative attribute based facial representation will be applied. Finally, expression localization and classification will be handled by using a multi-modal and multi-scale deep learning architecture.

 There are many studies on detecting and analyzing facial expressions recorded in a laboratory environment from the frontal face in which the subjects are placed close to the camera including just one exaggerated facial expression. However, having a distinct distance and more than one face, the facial detection and expression analysis made on subjects recorded in the wild is still not successful enough than those recorded in the laboratory environment. For that matter, the challenge “Emotion Recognition in the Wild” is being held to analyze small video clips having a single facial expression in which the success rate is still at the 40% rates even though the video clips and supported with audio input.

There are 4 contributions to the literature;

1. A deep learning architecture is proposed for eliminating the pose differences in face normalization, which is one of the most important issues in the performance of this problem. The proposed method approaches the 2-dimensional pose normalization problem as a nonlinear regression problem and builds an artificial neural network that resembles a patch-based multi-layered auto-encoder to learn the corresponding frontal face in a pose.

2. Subject based neutral face is calculated using a learning-based approach combined with the temporal analysis of facial action units. In the learning-based approach, the data with many neutral faces taken from different people are trained to generate the class labelled as “neutral”. When an unknown facial expression sequence is given, the pre-learned generic neutral face is used to make a median filtering in the temporal dimension. In this manner, it is also made possible to detect subtle facial motion using the calculated subject based neutral face.

3. High level facial attributes are suggested to represent facial expressions. An attribute based representation approach, which is considered a successful approach in object recognition problems especially in the past few years, is used for the first time in this project to classify facial expressions. In the first step, facial action units are taken as attributes, followed by the deduction of high level facial attributes using supervised techniques in combination with our previous study in which a random attributes approach was developed.

4. In order to analyze the facial expressions in image sequences, a multi-layer and multi-scale Deep Neural Network and Multi-Layer Perceptron is suggested to be used in the temporal space. The input layer of the suggested architecture contains the image including the frames in different neighborly frames (Motion History Image) as well as the facial landmarks taken together with the upper and lower face image patches. The output layer will contain the temporal expression result series.

 The software that will be implemented in the scope of this project will enable the offline analysis of facial expressions coming from in the wild video recordings, especially surveillance cameras; which can and will be used in areas that are using big data analytics and business intelligence applications such as customer satisfaction and employee performance analysis.