We propose to shift the goal of recognition from naming to describing. Doing so allows us not only to name familiar objects, but also: to report unusual aspects of a familiar object (“spotty dog”, not just “dog”); to say something about unfamiliar objects (“hairy and four-legged”, not just “unknown”); and to learn how to recognize new objects with few or no visual examples. Rather than focusing on identity assignment, we make inferring attributes the core problem of recognition. These attributes can be semantic (“spotty”) or discriminative (“dogs have it but sheep do not”). Learning attributes presents a major new challenge: generalization across object categories, not just across instances within a category. In this paper, we also introduce a novel feature selection method for learning attributes that generalize well across categories. We support our claims by thorough evaluation that provides insights into the limitations of the standard recognition paradigm of naming and demonstrates the new abilities provided by our attribute-based framework.
Binary and Relative Features
We build word models for American Sign Language (ASL) that transfer between different signers and different aspects. This is advantageous because one could use large amounts of labelled avatar data in combination with a smaller amount of labelled human data to spot a large number of words in human data. Transfer learning is possible because we represent blocks of video with novel intermediate discriminative features based on splits of the data. By constructing the same splits in avatar and human data and clustering appropriately, our features are both discriminative and semantically similar: across signers similar features imply similar words. We demonstrate transfer learning in two scenarios: from avatar to a frontally viewed human signer and from an avatar to human signer in a 3/4 view.
We propose an approach to find and describe objects within broad domains. We introduce a new dataset that provides annotation for sharing models of appearance and correlation across categories. We use it to learn part and category detectors. These serve as the visual basis for an integrated model of objects. We describe objects by the spatial arrangement of their attributes and the interactions between them. Using this model, our system can find animals and vehicles that it has not seen and infer attributes, such as function and pose. Our experiments demonstrate that we can more reliably locate and describe both familiar and unfamiliar objects, compared to a baseline that relies purely on basic category detectors.
Human-nameable visual “attributes” can benefit various recognition tasks. However, existing techniques restrict these properties to categorical labels (for example, a person is `smiling' or not, a scene is `dry' or not), and thus fail to capture more general semantic relationships. We propose to model relative attributes. Given training data stating how object/scene categories relate according to different attributes, we learn a ranking function per attribute. The learned ranking functions predict the relative strength of each property in novel images. We then build a generative model over the joint space of attribute ranking outputs, and propose a novel form of zero-shot learning in which the supervisor relates the unseen object category to previously seen objects via attributes (for example, `bears are furrier than giraffes'). We further show how the proposed relative attributes enable richer textual descriptions for new images, which in practice are more precise for human interpretation. We demonstrate the approach on datasets of faces and natural scenes, and show its clear advantages over traditional binary attribute prediction for these new tasks.
The current trend in image analysis and multimedia is to use information extracted from text and text processing techniques to help vision-related tasks, such as automated image annotation and generating semantically rich descriptions of images. In this work, we claim that image analysis techniques can "return the favor" to the text processing community and be successfully used for a general-purpose representation of word meaning. We provide evidence that simple low-level visual features can enrich the semantic representation of word meaning with information that cannot be extracted from text alone, leading to improvement in the core task of estimating degrees of semantic relatedness between words, as well as providing a new, perceptually-enhanced angle on word semantics. Additionally, we show how distinguishing between a concept and its context in images can improve the quality of the word meaning representations extracted from images.
Bag-of-visual words (BoW) model has recently been well advocated for image classification and search. However, one critical limitation of existing BoW model is the lack of semantic information. To alleviate the impact of this issue, it is imperative to construct semantic-aware visual dictionary. In this paper, we propose a novel approach for learning visual word dictionary embedding intermediate-level semantics. Specifically, we first introduce an Attribute aware Dictionary Learning(AttrDL) scheme to learn multiple sub-dictionaries with specific semantic meanings. We divide training images into different sets and each represents a specific attribute. For each image set, an attribute-aware sub-vocabulary is learned. Hence, these resulting sub-vocabularies are more discriminative for semantics than the traditional vocabularies. Second, to get semantic-aware and discriminative BoW representation with the learned sub-vocabularies, we adopt the idea of L21-norm regularized sparse coding and recode the resulting sparse representation of each image. Experimental results show that the proposed scheme outperforms the state-of-the-art algorithms in both image classification and search tasks.
Previous studies have shown that attribute based approaches obtained successful results in image classification. However, in human based supervised methods, it gets harder to determine all the attributes, which are associated with image classes, in large datasets. In addition, human beings have difficulties in characterizing discriminative attributes among images. In unsupervised methods, when the number of classes increases, the excessive growth of the search space appears to be a major problem. In this study, we try to solve the problems in supervised and unsupervised methods by random attribute approach. Random attributes can be defined as hypothetical attributes which depict images. They are extracted randomly from the feature space as binary or relative. The proposed approach has been compared to the other attribute based studies in the literature using the same data sets. The highest image classification performances obtained in other studies has been reached in the experiments especially as the number of attributes increase.
We propose a novel mode of feedback for image search, where a user describes which properties of exemplar images should be adjusted in order to more closely match his/her mental model of the image(s) sought. For example, perusing image results for a query “black shoes”, the user might state, “Show me shoe images like these, but sportier.” Offline, our approach first learns a set of ranking functions, each of which predicts the relative strength of a nameable attribute in an image (`sportiness', `furriness', etc.). At query time, the system presents an initial set of reference images, and the user selects among them to provide relative attribute feedback. Using the resulting constraints in the multi-dimensional attribute space, our method updates its relevance function and re-ranks the pool of images. This procedure iterates using the accumulated constraints until the top ranked images are acceptably close to the user's envisioned target. In this way, our approach allows a user to efficiently “whittle away” irrelevant portions of the visual feature space, using semantic language to precisely communicate her preferences to the system. We demonstrate the technique for refining image search for people, products, and scenes, and show it outperforms traditional binary relevance feedback in terms of search speed and accuracy.
We have created the first image search engine based entirely on faces. Using simple text queries such as “smiling men with blond hair and mustaches,” users can search through over 3.1 million faces which have been automatically labeled on the basis of several facial attributes. Faces in our database have been extracted and aligned from images downloaded from the internet using a commercial face detector, and the number of images and attributes continues to grow daily. Our classification approach uses a novel combination of Support Vector Machines and Adaboost which exploits the strong structure of faces to select and train on the optimal set of features for each attribute. We show state-of-the-art classification results compared to previous works, and demonstrate the power of our architecture through a functional, large-scale face search engine. Our framework is fully automatic, easy to scale, and computes all labels off-line, leading to fast on-line search performance. In addition, we describe how our system can be used for a number of applications, including law enforcement, social networks, and personal photo management. Our search engine will soon be made publicly available.
We study the problem of object classification when training and test classes are disjoint, i.e. no training examples of the target classes are available. This setup has hardly been studied in computer vision research, but it is the rule rather than the exception, because the world contains tens of thousands of different object classes and for only a very few of them image, collections have been formed and annotated with suitable class labels. In this paper, we tackle the problem by introducing attribute-based classification. It performs object detection based on a human-specified high-level description of the target objects instead of training images. The description consists of arbitrary semantic attributes, like shape, color or even geographic information. Because such properties transcend the specific learning task at hand, they can be pre-learned, e.g. from image datasets unrelated to the current task. Afterwards, new classes can be detected based on their attribute representation, without the need for a new training phase. In order to evaluate our method and to facilitate research in this area, we have assembled a new large-scale dataset, “Animals with Attributes”, of over 30,000 animal images that match the 50 classes in Osherson's classic table of how strongly humans associate 85 semantic attributes with animal classes. Our experiments show that by using an attribute layer it is indeed possible to build a learning object detection system that does not require any training images of the target classes.
Human-nameable visual attributes offer many advantages when used as mid-level features for object recognition, but existing techniques to gather relevant attributes can be inefficient (costing substantial effort or expertise) and/or insufficient (descriptive properties need not be discriminative). We introduce an approach to define a vocabulary of attributes that is both human understandable and discriminative. The system takes object/scene-labeled images as input, and returns as output a set of attributes elicited from human annotators that distinguish the categories of interest. To ensure a compact vocabulary and efficient use of annotators' effort, we 1) show how to actively augment the vocabulary such that new attributes resolve inter-class confusions, and 2) propose a novel “nameability” manifold that prioritizes candidate attributes by their likelihood of being associated with a nameable property. We demonstrate the approach with multiple datasets, and show its clear advantages over baselines that lack a nameability model or rely on a list of expert-provided attributes.
Traditional active learning allows a (machine) learner to query the (human) teacher for labels on examples it finds confusing. The teacher then provides a label for only that instance. This is quite restrictive. In this paper, we propose a learning paradigm in which the learner communicates its belief (i.e. predicted label) about the actively chosen example to the teacher. The teacher then confirms or rejects the predicted label. More importantly, if rejected, the teacher communicates an explanation for why the learner’s belief was wrong. This explanation allows the learner to propagate the feedback provided by the teacher to many unlabeled images. This allows a classifier to better learn from its mistakes, leading to accelerated discriminative learning of visual concepts even with few labeled images. In order for such communication to be feasible, it is crucial to have a language that both the human supervisor and the machine learner understand. Attributes provide precisely this channel. They are human-interpretable mid-level visual concepts shareable across categoriese.g. “furry”, “spacious”, etc. We advocate the use of attributes for a supervisor to provide feedback to a classifier and directly communicate his knowledge of the world. We employ a straightforward approach to incorporate this feedback in the classifier, and demonstrate its power on a variety of visual recognition scenarios such as image classification and annotation. This application of attributes for providing classifiers feedback is very powerful, and has not been explored in the community. It introduces a new mode of supervision, and opens up several avenues for future research.
In recent years, there has been a great deal of progress in describing objects with attributes. Attributes have proven useful for object recognition, image search, face verification, image description, and zero-shot learning. Typically, attributes are either binary or relative: they describe either the presence or absence of a descriptive characteristic, or the relative magnitude of the characteristic when comparing two exemplars. However, prior work fails to model the actual way in which humans use these attributes in descriptive statements of images. Specifically, it does not address the important interactions between the binary and relative aspects of an attribute. In this work we propose a spoken attribute classifier which models a more natural way of using an attribute in a description. For each attribute we train a classifier which captures the specific way this attribute should be used. We show that as a result of using this model, we produce descriptions about images of people that are more natural and specific than past systems.
We introduce the use of describable visual attributes for face verification and image search. Describable visual attributes are labels that can be given to an image to describe its appearance. This paper focuses on images of faces and the attributes used to describe them, although the concepts also apply to other domains. Examples of face attributes include gender, age, jaw shape, nose size, etc. The advantages of an attribute-based representation for vision tasks are manifold: They can be composed to create descriptions at various levels of specificity; they are generalizable, as they can be learned once and then applied to recognize new objects or categories without any further training; and they are efficient, possibly requiring exponentially fewer attributes (and training data) than explicitly naming each category. We show how one can create and label large data sets of real-world images to train classifiers which measure the presence, absence, or degree to which an attribute is expressed in images. These classifiers can then automatically label new images. We demonstrate the current effectiveness-and explore the future potential-of using attributes for face verification and image search via human and computational experiments. Finally, we introduce two new face data sets, named FaceTracer and PubFig, with labeled attributes and identities, respectively.
This paper introduces a new idea in describing people using their first names, i.e., the name assigned at birth. We show that describing people in terms of similarity to a vector of possible first names is a powerful description of facial appearance that can be used for face naming and building facial attribute classifiers. We build models for 100 common first names used in the United States and for each pair, construct a pairwise firstname classifier. These classifiers are built using training images downloaded from the internet, with no additional user interaction. This gives our approach important advantages in building practical systems that do not require additional human intervention for labeling. We use the scores from each pairwise name classifier as a set of facial attributes. We show several surprising results. Our name attributes predict the correct first names of test faces at rates far greater than chance. The name attributes are applied to gender recognition and to age classification, outperforming state-of-the-art methods with all training images automatically gathered from the internet.
Recent studies have shown that facial cosmetics have an influence on face recognition. Then a question is asked: Can we develop a face recognition system that is invariant to facial cosmetics? To address this problem, we propose a method called dual-attributes for face verification, which is robust to facial appearance changes caused by cosmetics or makeup. Attribute-based methods have shown successful applications in a couple of computer vision problems, e.g., object recognition and face verification. However, no previous approach has specifically addressed the problem of facial cosmetics using attributes. Our key idea is that the dual-attributes can be learned from faces with and without cosmetics, separately. Then the shared attributes can be used to measure facial similarity irrespective of cosmetic changes. In essence, dual-attributes are capable of matching faces with and without cosmetics in a semantic level, rather than a direct matching with low-level features. To validate the idea, we ensemble a database containing about 500 individuals with and without cosmetics. Experimental results show that our dual-attributes based approach is quite robust for face verification. Moreover, the dual attributes are very useful to discover the makeup effect on facial identities in a semantic level
Semantic attributes have been recognized as a more spontaneous manner to describe and annotate image content. It is widely accepted that image annotation using semantic attributes is a significant improvement to the traditional binary or multiclass annotation due to its naturally continuous and relative properties. Though useful, existing approaches rely on an abundant supervision and high-quality training data, which limit their applicability. Two standard methods to overcome small amounts of guidance and low-quality training data are transfer and active learning. In the context of relative attributes, this would entail learning multiple relative attributes simultaneously and actively querying a human for additional information. This paper addresses the two main limitations in existing work: 1) it actively adds humans to the learning loop so that minimal additional guidance can be given and 2) it learns multiple relative attributes simultaneously and thereby leverages dependence amongst them. In this paper, we formulate a joint active learning to rank framework with pairwise supervision to achieve these two aims, which also has other benefits such as the ability to be kernelized. The proposed framework optimizes over a set of ranking functions (measuring the strength of the presence of attributes) simultaneously and dependently on each other. The proposed pairwise queries take the form of which one of these two pictures is more natural? These queries can be easily answered by humans. Extensive empirical study on real image data sets shows that our proposed method, compared with several state-of-the-art methods, achieves superior retrieval performance while requires significantly less human inputs.
Semantic attributes represent an adequate knowledge that can be easily transferred to other domains where lack of information and training samples exist. However, in the classical object recognition case, where training data is abundant, attribute-based recognition usually results in poor performance compared to methods that used image features directly. We introduce a generic framework that boosts the performance of semantic attributes considerably in traditional classification and knowledge transfer tasks, such as zero-shot learning. It incorporates the discriminative power of the visual features and the semantic meaning of the attributes by learning a common latent space that joins both spaces. We also specifically account for the presence of attribute correlations in the source dataset to generalize more efficiently across domains. Our evaluation of the proposed approach on standard public datasets shows that it is not only simple and computationally efficient but also performs remarkably better than the common direct attribute model.
Immense amount of face image presence led to the dire need for an efficient face recognition approach that can automatically retrieve the preferred face image from the database. Thereby, this paper introduces the use of attributes for efficient face image search. The appearance of images can be described by labelling them with these attributes. This paper mainly focuses on face images and the attributes associated with them. Age, Race, Hair Color, Smiling, Bushy Eyebrows etc., are some of the face attributes. The benefit of using attribute-based face representation is that the face images can be categorized into multiple levels based on the attribute descriptions. For example, one can describe "brown female" for a group of people or "brown female long hair black eyes" for a specific person. We demonstrate the effectiveness of the proposed method by measuring the attribute scores for face images present in PubFig image database. Further, this paper presents a novel and efficient face image representation based on Local Octal Pattern (LOP) texture features. The standard methods (LBP, LTP, LTrP) are able to encode with a maximum of four distinct values about the relationship between the referenced pixel and its corresponding neighbors. The proposed method encodes eight distinct values by calculating the horizontal, vertical and diagonal directions of the pixels using first-order derivatives. The performance of the proposed method is analyzed with the standard methods in terms of average precision and average recall results obtained on PubFig image database.
This work proposes a novel sparse coding based approach for augmenting attributes in both object recognition and facial expression recognition applications. Attributes are a set of manually specified binary descriptions of visual objects. Though playing an important role in different applications like zero-shot learning, image description and recognition, the manually specified attributes suffer from the incomplete capturing of the original image data. In this work, we propose to augment the original manually specified semantic attributes with the augmented attributes which are also sparse, based on the minimization of the reconstruction error between the original image and the concatenated semantic and augmented attributes. We propose to iteratively learn the dictionaries as well as recover the augmented attributes in the optimization. For our applications of object recognition and facial expression recognition, the augmented attributes combined with the predicted semantic attributes can improve the overall recognition rate. Also, our learned dictionaries show certain meanings captured by the attributes.
We address the challenging large-scale content-based face image retrieval problem, intended as searching images based on the presence of specific subject, given one face image of him/her. To this end, one natural demand is a supervised binary code learning method. While the learned codes might be discriminating, people often have a further expectation that whether some semantic message (e.g., visual attributes) can be read from the human-incomprehensible codes. For this purpose, we propose a novel binary code learning framework by jointly encoding identity discriminability and a number of facial attributes into unified binary code. In this way, the learned binary codes can be applied to not only fine-grained face image retrieval, but also facial attributes prediction, which is the very innovation of this work, just like killing two birds with one stone. To evaluate the effectiveness of the proposed method, extensive experiments are conducted on a new purified large-scale web celebrity database, named CFW 60K, with abundant manual identity and attributes annotation, and experimental results exhibit the superiority of our method over state-of-the-art.
Facial Attributes for Active Authentication on Mobile Device
This paper presents a new approach for facial attribute classification using multi-task learning model. Our model learns a shared feature representation that is well-suited for multiple attribute classification. For learning this shared feature representation we use a Restricted Boltzmann Machines based model and enhance it with a factored multi-task component to become Multi-Task Restricted Boltzmann Machines. We operate on faces and facial landmark points and learn a joint feature representation for all attributes. We use an iterative learning approach consisting of a bottom-up/top-down pass to learn the shared representation of our multi-task model and at inference we use a bottom-up pass to predict the different tasks. Our approach is not restricted to any type of attributes, however, for this paper we focus only on facial attributes. We evaluate our approach on three publicly available datasets and show superior classification performance improvement over the state-of-the-art.
Recent works have shown that facial attributes are useful in a number of applications such as face recognition and retrieval. However, estimating attributes in images with large variations remains a big challenge. This challenge is addressed in this paper. Unlike existing methods that assume the independence of attributes during their estimation, our approach captures the interdependencies of local regions for each attribute, as well as the high-order correlations between different attributes, which makes it more robust to occlusions and misdetection of face regions. First, we have modeled region interdependencies with a discriminative decision tree, where each node consists of a detector and a classifier trained on a local region. The detector allows us to locate the region, while the classifier determines the presence or absence of an attribute. Second, correlations of attributes and attribute predictors are modeled by organizing all of the decision trees into a large sum-product network (SPN), which is learned by the EM algorithm and yields the most probable explanation (MPE) of the facial attributes in terms of the region's localization and classification. Experimental results on a large data set with 22, 400 images show the effectiveness of the proposed approach.
Relating visual information to its linguistic semantic meaning remains an open and challenging area of research. The semantic meaning of images depends on the presence of objects, their attributes and their relations to other objects. But precisely characterizing this dependence requires extracting complex visual information from an image, which is in general a difficult and yet unsolved problem. In this paper, we propose studying semantic information in abstract images created from collections of clip art. Abstract images provide several advantages over real images. They allow for the direct study of how to infer high-level semantic information, since they remove the reliance on noisy low-level object, attribute and relation detectors, or the tedious hand-labeling of real images. Importantly, abstract images also allow the ability to generate sets of semantically similar scenes. Finding analogous sets of real images that are semantically similar would be nearly impossible. We create 1,002 sets of 10 semantically similar abstract images with corresponding written descriptions. We thoroughly analyze this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity. Finally, we study the relation between the saliency and memorability of objects and their semantic importance.