We propose a high-level
image representation, called the Object Bank, where an image is
represented as a scale-invariant response map of a large number of
pre-trained generic object detectors, blind to the testing dataset or
visual task. As we tackle higher level visual recognition problems, we
show that more semantic level image representation such as the Object
Bank can capture important information in the image without evoking
highly elaborate statistical models to build up the features and
concepts from pixels or low-level features.
Robust low-level image features have been proven to be effective
representations for a variety of tasks such as object recognition and
scene classification; but pixels, or even local image patches, carry
little semantic meanings. For high level visual tasks, such low-level
image representations are potentially not enough. We propose a
high-level image representation, called the Object Bank, where an
image is represented as a scale-invariant response map of a large
number of pre-trained generic object detectors, blind to the testing
dataset or visual task. As we try to tackle higher level visual
recognition problems, we show that Object Bank representation is
powerful on scene classification tasks. It significantly outperforms
the low level image representations and the state-of-the-art
approaches on several benchmark datasets.