Abstract

Most existing zero-shot learning approaches exploit transfer learning via an intermediate-level semantic representation such as visual attributes or semantic word vectors. Such a semantic representation is shared between an annotated auxiliary dataset and a target dataset with no annotation. A projection from a low-level feature space to the semantic space is learned from the auxiliary dataset and is applied without adaptation to the target dataset. In this paper we identify an inherent limitation with this approach. That is, due to having disjoint and potentially unrelated classes, the projection functions learned from the auxiliary dataset/domain are biased when applied directly to the target dataset/domain. We call this problem the projection domain shift problem and propose a novel framework, transductive multi-view embedding, to solve it. It is ‘transductive’ in that unlabelled target data points are explored for projection adaptation, and ‘multi-view’ in that both low-level feature (view) and multiple semantic representations (views) are embedded to rectify the projection shift. We demonstrate through extensive experiments that our framework (1) rectifies the projection shift between the auxiliary and target domains, (2) exploits the complementarity of multiple semantic representations, (3) achieves state-of-the-art recognition results on image and video benchmark datasets, and (4) enables novel cross-view annotation tasks.

Idea Illustration

An inherent problem exists in this zero-shot learning approach: Since the two datasets have different and potentially unrelated classes, the underlying semantic prototypes of classes also differ, as do the ‘ideal’ projection functions between the low-level feature space and the semantic spaces. Therefore, using the projection functions learned from the auxiliary dataset/domain without any adaptation to the target dataset/domain causes an unknown shift/bias. We call it the projection domain shift problem. This problem is illustrated in Fig. 1, which shows two object classes from the Animals with Attributes (AwA) dataset [10]: Zebra is one of the 40 auxiliary classes whilst Pig is one of 10 target classes. Both of them share the same ‘hasTail’ attribute, but the visual appearance of their tails differs greatly (Fig. 1(a)). Similarly, many other attributes of Pig are visually very different from those in the 40 auxiliary classes. Fig. 1(b) plots (in 2D using t-SNE [11]) a 85D attribute space representation of the image feature projections and class prototypes (85D binary attribute vectors) to illustrate the existence of the projection domain shift problem: a great discrepancy between the Pig prototype position in the semantic attribute space and the projections of its class member instances is observed, while the discrepancy does not exist for the auxiliary Zebra class. This discrepancy is caused when the projection functions learned from the 40 auxiliary classes are applied directly to project the Pig instances – what ‘hasTail’ (as well as the other 84 attributes) visually means is different now. Such a discrepancy will inherently degrade the effectiveness of zero-shot recognition of the Pig class. This projection domain shift problem is uniquely challenging in that there is no labelled information in the target domain to guide domain adaptation in mitigating the problem. To our knowledge, this problem has neither been identified nor addressed in the literature.

Download

100-dimension word vector Download 1000-dimension word vector Download
100 dimension semantic word vectors for 50 AwA classes
Other-download

Codes and semantic word dictionary

https://github.com/yanweifu/embedding_zero-shot-learning

We are using all wikipedia articles to train the google word2vec recurrent neural network model
and generate the semantic word vector dictionary; and we will release it to the community.

Paper:

[1] Fu, Yanwei; Hospedales, T.; Xiang, T.; Fu, Z.; Gong, S: Transductive Multi-view Embedding for Zero-Shot Recognition and Annotation, (ECCV 2014). Paper

bib: @INPROCEEDINGS{embedding2014ECCV,
author = { Yanwei Fu and Timothy M. Hospedales and Tao Xiang and Zhenyong Fu and Shaogang Gong},
title = {Transductive Multi-view Embedding for Zero-Shot Recognition and Annotation},
booktitle = {ECCV},
year = {2014}
}

[2] Fu, Yanwei, Hospedales, T.M., Xiang, T., Gong, S.: Transductive multi-view zero-shot Learning. Accepted to IEEE TPAMI (2015)

bib: @INPROCEEDINGS{embedding2014ECCV,
author = { Yanwei Fu and Timothy M. Hospedales and Tao Xiang and Shaogang Gong},
title = {Transductive multi-view zero-shot learning},
booktitle = {accepted to TPAMI,}
year = {2014}
}

t-SNE visualisation of our journal submission