机器人与人工智能爱好者论坛

 找回密码
 立即注册
查看: 5213|回复: 0
打印 上一主题 下一主题

Explain Images with Multimodal Recurrent Neural Networks

[复制链接]

80

主题

103

帖子

749

积分

高级会员

Rank: 4

积分
749
跳转到指定楼层
楼主
发表于 2016-1-5 22:12:09 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式
Explain Images with Multimodal Recurrent Neural Networks



Junhua Mao1;2 Wei Xu1 Yi Yang1 Jiang Wang1 Alan L. Yuille2
1Baidu Research 2University of California, Los Angeles
mjhustc@ucla.edu, fwei.xu,yangyi05,wangjiang03g@baidu.com, yuille@stat.ucla.edu
Abstract
In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model
for generating novel sentence descriptions to explain the content of images. It
directly models the probability distribution of generating a word given previous
words and the image. Image descriptions are generated by sampling from this
distribution. The model consists of two sub-networks: a deep recurrent neural
network for sentences and a deep convolutional network for images. These two
sub-networks interact with each other in a multimodal layer to form the whole
m-RNN model. The effectiveness of our model is validated on three benchmark
datasets (IAPR TC-12 [8], Flickr 8K [28], and Flickr 30K [13]). Our model outperforms
the state-of-the-art generative method. In addition, the m-RNN model
can be applied to retrieval tasks for retrieving images or sentences, and achieves
significant performance improvement over the state-of-the-art methods which directly
optimize the ranking objective function for retrieval.
1 Introduction
Obtaining sentence level descriptions for images is becoming an important task and has many applications,
such as early childhood education, image retrieval, and navigation for the blind. Thanks
to the rapid development of computer vision and natural language processing technologies, recent
works have made significant progress for this task (see a brief review in Section 2). Many of these
works treat it as a retrieval task. They extract features for both sentences and images, and map them
to the same semantic embedding space. These methods address the tasks of retrieving the sentences
given the query image or retrieving the images given the query sentences. But they can only label
the query image with the sentence annotations of the images already existing in the datasets, thus
lack the ability to describe new images that contain previously unseen combinations of objects and
scenes.
In this work, we propose a multimodal Recurrent Neural Networks (m-RNN) model to address both
the task of generating novel sentences descriptions for images, and the task of image and sentence
retrieval. The whole m-RNN architecture contains a language model part, an image part and a
multimodal part. The language model part learns the dense feature embedding for each word in the
dictionary and stores the semantic temporal context in recurrent layers. The image part contains a
deep Convulutional Neural Network (CNN) [17] which extracts image features. The multimodal
part connects the language model and the deep CNN together by a one-layer representation. Our
m-RNN model is learned using a perplexity based cost function (see details in Section 4). The
errors are backpropagated to the three parts of the m-RNN model to update the model parameters
simultaneously. To the best of our knowledge, this is the first work that incorporates the Recurrent
Neural Network in a deep multimodal architecture.
In the experiments, we validate our model on three benchmark datasets: IAPR TC-12 [8], Flickr 8K
[28], and Flickr 30K [13]. we show that our method significantly outperforms the state-of-the-art
methods in both the task of generating sentences and the task of image and sentence retrieval when
using the same image feature extraction networks. Our model is extendable and has the potential to
be further improved by incorporating more powerful deep networks for the image and the sentence.



Explain Images with Multimodal Recurrent Neural Networks.pdf (575.56 KB, 下载次数: 2)
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

站长推荐上一条 /1 下一条

QQ|Archiver|手机版|小黑屋|陕ICP备15012670号-1    

GMT+8, 2024-4-30 10:28 , Processed in 0.081725 second(s), 27 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.

快速回复 返回顶部 返回列表