It is common to use domain specific terminology – attributes – to describe the visual appearance of objects. In order to scale the use of these describable visual attributes to a large number of categories, especially those not well studied by psychologists or linguists, it will be necessary to find alternative techniques for identifying attribute vocabularies and for learning to recognize attributes without hand labeled training data. We demonstrate that it is possible to accomplish both these tasks automatically by mining text and image data sampled from the Internet. The proposed approach also characterizes attributes according to their visual representation: global or local, and type: color, texture, or shape. This work focuses on discovering attributes and their visual appearance, and is as agnostic as possible about the textual description.
We present an interactive, hybrid human-computer method for object classification. The method applies to classes of objects that are recognizable by people with appropriate expertise (e.g., animal species or airplane model), but not (in general) by people without such expertise. It can be seen as a visual version of the 20 questions game, where questions based on simple visual attributes are posed interactively. The goal is to identify the true class while minimizing the number of questions asked, using the visual content of the image. We introduce a general framework for incorporating almost any off-the-shelf multi-class object recognition algorithm into the visual 20 questions game, and provide methodologies to account for imperfect user responses and unreliable computer vision algorithms. We evaluate our methods on Birds-200, a difficult dataset of 200 tightly-related bird species, and on the Animals With Attributes dataset. Our results demonstrate that incorporating user input drives up recognition accuracy to levels that are good enough for practical applications, while at the same time, computer vision reduces the amount of human interaction required.
We propose an approach to find and describe objects within broad domains. We introduce a new dataset that provides annotation for sharing models of appearance and correlation across categories. We use it to learn part and category detectors. These serve as the visual basis for an integrated model of objects. We describe objects by the spatial arrangement of their attributes and the interactions between them. Using this model, our system can find animals and vehicles that it has not seen and infer attributes, such as function and pose. Our experiments demonstrate that we can more reliably locate and describe both familiar and unfamiliar objects, compared to a baseline that relies purely on basic category detectors.
We investigate the task of learning models for visual object recognition from natural language descriptions alone. The approach contributes to the recognition of fine-grain object categories, such as animal and plant species, where it may be difficult to collect many images for training, but where textual descriptions of visual attributes are readily available. As an example we tackle recognition of butterfly species, learning models from descriptions in an online nature guide. We propose natural language processing methods for extracting salient visual attributes from these descriptions to use as ‘templates’ for the object categories, and apply vision methods to extract corresponding attributes from test images. A generative model is used to connect textual terms in the learnt templates to visual attributes. We report experiments comparing the performance of humans and the proposed method on a dataset of ten butterfly categories.
Relating visual information to its linguistic semantic meaning remains an open and challenging area of research. The semantic meaning of images depends on the presence of objects, their attributes and their relations to other objects. But precisely characterizing this dependence requires extracting complex visual information from an image, which is in general a difficult and yet unsolved problem. In this paper, we propose studying semantic information in abstract images created from collections of clip art. Abstract images provide several advantages. They allow for the direct study of how to infer high-level semantic information, since they remove the reliance on noisy low-level object, attribute and relation detectors, or the tedious hand-labeling of images. Importantly, abstract images also allow the ability to generate sets of semantically similar scenes. Finding analogous sets of semantically similar real images would be nearly impossible. We create 1,002 sets of 10 semantically similar abstract scenes with corresponding written descriptions. We thoroughly analyze this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity.
One of the main challenges in learning fine-grained visual categories is gathering training images. Recent work in Zero-Shot Learning (ZSL) circumvents this challenge by describing categories via attributes or text. However, not all visual concepts, e.g., two people dancing, are easily amenable to such descriptions. In this paper, we propose a new modality for ZSL using visual abstraction to learn difficult-to-describe concepts. Specifically, we explore concepts related to people and their interactions with others. Our proposed modality allows one to provide training data by manipulating abstract visualizations, e.g., one can illustrate interactions between two clipart people by manipulating each person’s pose, expression, gaze, and gender. The feasibility of our approach is shown on a human pose dataset and a new dataset containing complex interactions between two people, where we outperform several baselines. To better match across the two domains, we learn an explicit mapping between the abstract and real worlds.
Visual content analysis has always been important yet challenging. Thanks to the popularity of social networks, images become an convenient carrier for information diffusion among online users. To understand the diffusion patterns and different aspects of the social images, we need to interpret the images first. Similar to textual content, images also carry different levels of sentiment to their viewers. However, different from text, where sentiment analysis can use easily accessible semantic and context information, how to extract and interpret the sentiment of an image remains quite challenging. In this paper, we propose an image sentiment prediction framework, which leverages the mid-level attributes of an image to predict its sentiment. This makes the sentiment classification results more interpretable than directly using the low-level features of an image. To obtain a better performance on images containing faces, we introduce eigenface-based facial expression detection as an additional mid-level attributes. An empirical study of the proposed framework shows improved performance in terms of prediction accuracy. More importantly, by inspecting the prediction results, we are able to discover interesting relationships between mid-level attribute and image sentiment.
In many problems of machine learning and computer vision, there exists side information, i.e., information contained in the training data and not available in the testing phase. This motivates the recent development of a new learning approach known as learning with side information that aims to incorporate side information for improved learning algorithms. In this work, we describe a new training method of boosting classifiers that uses side information, which we term as AdaBoost+. In particular, AdaBoost+ employs a novel classification label imputation method to construct extra weak classifiers from the available information that simulate the performance of better weak classifiers obtained from the features in side information. We apply our method to two problems, namely handwritten digit recognition and facial expression recognition from low resolution images, where it demonstrates its effectiveness in classification performance.
Relevant and irrelevant web images collected by tag-based image retrieval have been employed as loosely labeled training data for learning SVM classifiers for image categorization by only using the visual features. In this work, we propose a new image categorization method by incorporating the textual features extracted from the surrounding textual descriptions (tags, captions, categories, etc.) as privileged information and simultaneously coping with noise in the loose labels of training web images. When the training and test samples come from different datasets, our proposed method can be further extended to reduce the data distribution mismatch by adding a regularizer based on the Maximum Mean Discrepancy (MMD) criterion. Our comprehensive experiments on three benchmark datasets demonstrate the effectiveness of our proposed methods for image categorization and image retrieval by exploiting privileged information from web data.
The computer vision community has reached a point when it can start considering high-level reasoning tasks such as the “communicative intents” of images, or in what light an image portrays its subject. For example, an image might imply that a politician is competent, trustworthy, or energetic. We explore a variety of features for predicting these communicative intents. We study a number of facial expressions and body poses as cues for the implied nuances of the politician’s personality. We also examine how the setting of an image (e.g. kitchen or hospital) influences the audience’s perception of the portrayed politician. Finally, we improve the performance of an existing approach on this problem, by learning intermediate cues using convolutional neural networks. We show state of the art results on the Visual Persuasion dataset of Joo et al.
As humans, we love to rank things. Top ten lists exist for everything from movie stars to scary animals. Ambiguities (i.e. ties) naturally occur in the process of ranking when people feel they cannot distinguish two items. Human reported rankings derived from star ratings abound on recommendation websites such as Yelp and Netflix. However, those websites differ in star precision which points to the need for ranking systems that adapt to an individual users preference sensitivity. In this work we propose an adaptive system that allows for ties when collecting ranking data. Using this system, we propose a framework for obtaining computer-generated rankings. We test our system and a computer-generated ranking method on the problem of evaluating facial aesthetics. Since aesthetics is a personalized and subjective issue, and it is hard to obtain large amount of aesthetics rankings from each user, we extract low-dimensional discriminative features from weakly labelled facial images and apply them to afterward learning. Extensive experimental evaluations and analysis well demonstrate the effectiveness of our work.
We consider the problem of dynamically apportioning resources among a set of options in a worst-case on-line framework. The model we study can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting. We show that the multiplicative weight-update rule of Littlestone and Warmuth  can be adapted to this mode yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems. We show how the resulting learning algorithm can be applied to a variety of problems, including gambling, multiple-outcome prediction, repeated games and prediction of points in n . We also show how the weight-update rule can be used to derive a new boosting algorithm which does not require prior knowledge about the performance of the weak learning algorithm.
High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such “autoencoder” networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.
The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.
We show how to use “complementary priors” to eliminate the explaining-away effects thatmake inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of thewake-sleep algorithm. After fine-tuning, a networkwith three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to displaywhat the associativememory has in mind
We present in this paper a novel approach for training deterministic auto-encoders. We show that by adding a well chosen penalty term to the classical reconstruction cost function, we can achieve results that equal or surpass those attained by other regularized autoencoders as well as denoising auto-encoders on a range of datasets. This penalty term corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. We show that this penalty term results in a localized space contraction which in turn yields robust features on the activation layer. Furthermore, we show how this penalty term is related to both regularized auto-encoders and denoising auto-encoders and how it can be seen as a link between deterministic and non-deterministic auto-encoders. We find empirically that this penalty helps to carve a representation that better captures the local directions of variation dictated by the data, corresponding to a lower-dimensional non-linear manifold, while being more invariant to the vast majority of directions orthogonal to the manifold. Finally, we show that by using the learned features to initialize a MLP, we achieve state of the art classification error on a range of datasets, surpassing other methods of pretraining.
While logistic sigmoid neurons are more biologically plausible than hyperbolic tangent neurons, the latter work better for training multi-layer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero, creating sparse representations with true zeros which seem remarkably suitable for naturally sparse data. Even though they can take advantage of semi-supervised setups with extra-unlabeled data, deep rectifier networks can reach their best performance without requiring any unsupervised pre-training on purely supervised tasks with large labeled datasets. Hence, these results can be seen as a new milestone in the attempts at understanding the difficulty in training deep but purely supervised neural networks, and closing the performance gap between neural networks learnt with and without unsupervised pre-training.
One conventional tool for interpolating surfaces over scattered data, the thin-plate spline, has an elegant algebra expressing the dependence of the physical bending energy of a thin metal plate on point constraints. For interpolation of a surface over a fixed set of nodes in the plane, the bending energy is a quadratic form in the heights assigned to the surface.
Face recognition in real-world conditions requires the ability to deal with a number of conditions, such as variations in pose, illumination and expression. In this paper, we focus on variations in head pose and use a computationally efficient regression-based approach for synthesising face images in different poses, which are used to extend the face recognition training set.
We consider the problem ofdynamically apportionlng resources among a set of options in a worst-case on-line framework. The model we study can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting
They describe a new method of matching statistical models of appearance to images. A set of model parameters control modes of shape and gray-level variation learned from a training set. They construct an efficient iterative matching algorithm by learning the relationship between perturbations in the model parameters and the induced image errors.