Automatic generation of a text according to a picture

Armeeva A.
Department of theoretical and computer linguistics
Moscow State University 
email: anna@bitsoft.ru


Abstract. This paper describes a model for a computer program that must
generate a text according to a picture. The picture belongs to one of three
types - interior, landscape or still life. The program takes into account
the cognitive abilities of a human being connected with the estimation of
objects which is a necessary step for a verbal description of a picture.
The program starts by writing a list of objects represented by their names
and coordinates. Each object receives its prototipical and variable weight.
The next stage is connected with taking into account the location of the
objects and the planning of description. The last stage is text production.

This paper describes a model for a computer program that must generate a
text according to a picture. The picture belongs to one of three types -
interior, landscape or still life. The program takes into account the
cognitive abilities of a human being connected with the estimation of
objects which is a necessary step for a verbal description of a picture. In
this program the objects are compared in their size, stability. The third
property is the type itself of an object (slots of a frame, animated
objects, etc.). All these properties influence the salience of an object.
This salient object appears in the beginning of the text or a text
fragment, is chosen as the Figure. There are prototypical weights contained
in the memory of a human being in a ?knowledge base¦ and variable weights
connected with the properties of a real picture in the program. These
weights permit to model the possibility of text generalization on the basis
of visual perception. Thus two cognitive processes - verbalization and
visual perception are brought into correlation. One of the essential
cognitive principles uniting these processes is the perceptual difference
between the Figure and the Ground. This difference is subsumed under the
notion ?salience¦ as it is used in cognitive linguistics. The base of the
salience is the unequalty of the parts of a picture and a text. In our
program salience is modeled with the weights forming the hierarchy of the
describing objects. It is the salience on the level of the knowledge
organization. As for the text organization, the salience of an object is
reflected by the order of its appearance in a text and its semantic role in
a locative sentence (Ground vs. Figure).
We do not analyze the stage of objects' image recognition. It is assumed
that all the objects are already recognized and every object has a name,
denoting its category. The program starts by writing a list with names and
coordinates of the objects. The objects have two x-coordinates, two
y-coordinates and one z-coordinate. The objects' shape is a flat rectangle
without thickness. 
N1[(x1 y1) (x2 y1) (x1 y2) (x2  y2) (z1)] [additional information],
N2[(x1 y1) (x2 y1) (x1 y2) (x2 y2) (z1)] [additional information],  
, ... , 
Nk[(x1 y1) (x2 y1) (x1 y2) (x2 y2) (z1)] [additional information],  
where Ni -  a name of an object, [(x1 y1) (x2 y1) (x1 y2) (x2 y2) (z1)] -
coordinates of the objects, (x1 y1) - a left lower point, (x2 y2) - a right
lower point, (x1 y2) - a left upper point,  (x2 y2) - a right upper point,
(z1) - a z-coordinate.
The names of objects are contained in a base and belong to one of three
groups - ?Interior¦, ?Landscape¦ or ?Still life¦. If the list of objects
contains objects belonging to the group ?Landscape¦, the mode of
description ?Landscape¦ will be selected. If all the objects not belonging
to the lower level (it means they have the smallest y-coordinates) belong
to the group ?Still life¦, the mode ?Still life¦ will be selected. If all
the objects belong to the group ?Interior¦, the mode ?Interior¦ will be
selected
	Each name receives a prototypical weight that is built on the basis of
some prototypical properties of the object having this name. These
properties are typical size and typical degree of stability/instability.
The third property is belonging to one of the types.
An object receives the mark 0, if it is stable. If an object is unstable,
it receives -1. The objects are considered as unstable, if they are animate
or can be moved easily (statuette, cup, plate etc.). 
The typical size of an object depends on belonging of this object to one of
the group. If the object ?person¦ belongs both to the group ?Interior¦ and
the group ?Landscape¦, it receives different marks in these groups. It
depends on the context. In the interior the person is more important than
in the landscape. And a plate belonging to the group ?Still life¦ differs
from a plate in ?Interior¦.
In the group ?Interior¦ an object receives following marks: 1 (if it is
smaller than a person), 2 (if it is equal to a person in size) or 3 (it is
larger than a person). In the group ?Landscape¦ an object receives: 1 (it
is point-like and equal to a person in size), 2 (it is point-like and
larger than a person), 3 (it is point-like and much larger than a person),
4 (it is linear and much larger than a person) or 5 (it has a large area
and is much larger than a person). In the group ?Landscape¦ an object
receives: 2 (small), 3 (middle), 4 (large). 
The third property is the type of an object. There are following types:
slots of a frame, animated objects, parts of objects, supports (tables,
chairs, divans, etc.), covering objects (statues, tableware, etc.). The
belonging of an object to one of the types influences the order of
appearance of objects in a text, the choice of this object as the Figure or
the Ground in such sentences as ?the bike is near the house¦.
An object receives following marks as belonging to one of the types: in the
group ?Interior¦ all animated, covering objects, parts of objects  receive
the mark -1. The rest of the objects receive 0. In the group ?Landscape¦
all animated objects receive the mark -1. The rest of the objects receive
0. In the group ?Still life¦ all covering objects receive 1, the rest  - 0.
All the marks are summed up, this sum is the resulting prototypical weight.
Group ?Interior¦

		type size stability	 sum	
wardrobe 	0	3	0	 3	
divan		0	2	0	 2	
table		0	2	0	 2	
support 	0	2	0	 2	
chair 		0	1	0	 1	
person		-1	2	-1	 0	
door		0	3	0	 3	
window		0	2	0	 2	
statue		-1	2	0	 1	
statuette	-1	1	-1	-1	
stove		0	3	0	 3	
picture 	0	1	0	 1	
mirror		0	1	0	 1	
chandelier 	0	1	0	 1	

Group ?Landscape¦

		type size stability	 sum	
tree 		0	2	0	2	
building	0	3	0	3	
bush		0	1	0	1	
river		0	4	0	4	
field		0	5	0	5	
lake		0	5	0	5	
road		0	4	0	4	
person		-1	1	-1	-1	

Group ?Still life¦

		type size stability	 sum	
plate		+1	2	-1	2	
table		+1	4	0	5	
wineglass	+1	2	-1	2	
vase		+1	3	-1	3	
grapes		0	2	-1	1	
tray		+1	4	0	5	

The prototypical weights reflect some properties of an object-prototype.
These properties influence the order of appearance of objects in a text,
the choice of this object as the Figure or the Ground. The name of an
object may be connected with some variable weights that are summed up with
the prototypical weight. The variable weights depend on the coordinates of
an object. 
The list of the variable weights:
isolatedness	0 (standard)	-1 (isolated object)
size	0 (small object)	+1 (great object)
singularity/plurality	0 (no plurality)	+1 (plurality)
remoteness 0 (foreground) -1 (intermediate space) -2 (background)
partial representation 0 (complete representation)	-1 (partial
representation)
The variable weights are characterized in following way:
isolatedness - the object is isolated if its length is less than the
distance between the adjacent objects.
size -  The adjacent objects having the equal prototypical weight are
compared in area. The greater object receives +1, the smaller - 0. We
compare the objects on one vertical line not belonging to the most lower
level and the objects on one horizontal line belonging to the most lower
level. 
singularity/plurality - two or more objects with equal names and equal
sizes belonging to one level and being close build a set. It means that
they receive +1 in plurality. They will be described together (ex. ?I see
books¦ instead ?I see a book¦). If sizes of the objects with equal names
are different, the objects do not build a set. In this case one of them
becomes the Figure, the other - the Ground. The adjectives ?great¦ and
?small¦ are used. 
?One horizontal line¦ is defined in following way: the objects are on one
horizontal line, if they cross the vertical drawing from the upper surface
of the highest object or if they are below than this vertical. The
z-coordinate of all the objects is equal.
?One vertical line¦ is defined analogous: the objects are on one vertical
line, if they are within the limits of the verticals drawing from the
lateral sides of the object belonging to the lower level. The z-coordinate
of all the objects is equal.
Remoteness - realized only for intermediate space and background.
Partial representation - some objects can be represented only partially.
The information about the partial representation is presented with the
coordinates of the objects in the field ?additional information¦.
The next stage is connected with taking into account the location of the
objects and the planning of description. The object with the maximal weight
is found, it is the Main Object. No more than two objects are taken in all
horizontal and vertical directions. If there are more than two objects, the
next objects are not described with the Main Object. If the length of each
object to the right and to the left of the Main Object is less than the
distance between the Main Object and these objects, they do not describe
together with the Main Object. The object with the maximal weight is
selected from them, etc.
If the objects with equal names not belonging to one set are described
together, the adjective ?another¦ is put in. If there are two pairs of
objects with equal names, the words ?one more¦ are put in before the second
object of the second pair. 
Trajectory of description. Our description strategy is a point-by-point
strategy that is anchored at some objects. And this anchorage is connected
with the properties of these objects. Firstly the group of the Main Object
is described, then the following object with the maximal weight is
selected, its group is described, etc. If there are several objects with
maximal weight, the description moves from left to right.
Description in groups. The general scheme is following: firstly the object
with the maximal weight is described, then - the upper objects, then - the
front objects, then - the back objects, then - the left objects and the
right objects. The latter four groups can be enlarged with the description
of upper objects. Each group can contain no more than two objects. If some
front or back objects are at the same distance from the Main Object, we
describe firstly the object with the greater weight. If there are more than
one such objects, we describe firstly the left object. If many objects
belonging to the still life are presented in an interior, they are
described by enumeration, without figure-ground relations (ex. - ?there are
jugs, a bottle, a cup on the table¦).
The last stage is text production. There is a base with morphological
properties of used words. We build template constructions with number
agreement control. 
The text production starts by choosing a word characterizing the picture.
It depends on the strategy. We can choose one of the words ?landscape¦,
?interior¦ ?still life¦.
After that the text production bases on the patterns belonging to one of
these strategies.
The example - ?Interior¦.
The foreground
Each of the following groups (except 0) can be omitted.
0) The Main Object A1 (it is described by the pattern K0) - 1) the upper
objects, 2) the front objects, 3) the back objects, 4) the left objects 5)
the right objects. The structure of the groups 1) - 5) can be various.
There are the possible variants:
Case 1
If there is only one object closed to A1 (or a group of the objects
building a set), each group 1) - 5) is described by the patterns:
Group 1) - K1; Group 2) - K2; Group 3) - K3; Group 4) - K4; Group 5) - K5.
Case 2
If the groups 1), 2), 3) contain two objects closed to A1 not building a
set, they are described by following patterns...
Case 3
If the groups 1) - 5) contain two objects (not building a set) at the
different distance from A1, they are described by following patterns...
The groups 2) - 5) can be supplemented with upper objects. In this case
they are described by following patterns...
Below are some patterns:
Conventional signs:
A - object
N - name of the object describing in this sequence
I - name of the Main Object A1
A(N) - object describing in this sequence

At the first line we write down the number of the pattern and some
comments. At the second line there is the template construction reflecting
the order of the produced sequence parts. In square brackets there is a
optional part (it is one of the following words: ?great¦, ?small¦,
?another¦, ?one more¦). In round brackets we write down the morphological
properties of a word. It is the number, the choice of the number depends on
the situation (we can describe a set or a single object). The next lines
contain the realization of the template constructions. The slash means the
variation, it depends on the describing situation on the picture. Some
patterns are compound, they contain some mutually incompatible variants
(ex. pattern K1). 

K0 (the Main Object A1)
PP V [Adj] Art N (pl./sg.)
PP : in the center / on the left / on the right
V: there is (if N - sg.) / there are (if N - pl.)
Art: a (if N - sg.)

K1	(A1 and A(N) touch each other)
Art1 [Adj] N (pl./sg.) V Prep  [Adj] Art2 I (pl./sg.)
Art1: a (if N - sg.)
V: is (if N  - sg.) / are (if N  - pl.)
Prep: on
Art2: the
or
K1	(A1 - a set, A(N) touches one of the objects of I)
Art1 [Adj] N (sg./pl.) V Prep  Pron Art2 [Adj] I (pl.)
Art1: a (if N - sg.)
V: is (if N  - sg.) / are (if N  - pl.)
Prep: on 
Pron: one of
Art2: the
or
K1	(A(I)  and A(N) do not touch each other)
Art1 [Adj] N (pl./sg.) V Prep  [Adj] Art2 I (pl./sg.)
Art1: a (if N - sg.)
V: hangs (if N  - sg.) / hang (if N  - pl.)
Prep: over
Art2: the

K2 (A(N) is before A1) 
V Art1 [Adj] N (pl./sg.) Prep Art2 [Adj]  I (pl./sg.)
V: there is (if N  - sg.) / there are (if N  - pl.)
Art: a (if N - sg.)
Prep: before
Art2: the

K3 (A(N) is behind A1) 
V Art1 [Adj] N (pl./sg.) Prep Art2 [Adj]  I (pl./sg.)
V: there is (if N  - sg.) / there are (if N  - pl.)
Art: a (if N - sg.)
Prep: behind
Art2: the

K4 (A(N) is at the left of A1)
Art1 [Adj] N(pl./sg.) V Prep Art2 [Adj]  I (Gen; pl./sg.)
Art1: a (if N - sg.)
Prep:  at the left of 
V: is (if N - sg.) / are (if N - pl.)
Art2: the

K5 (A(N) is at the right of A1)
Art1 [Adj] N(pl./sg.) V Prep Art2 [Adj]  I (Gen; pl./sg.)
Art1: a (if N - sg.)
Prep:  at the right of 
V: is (if N - sg.) / are (if N - pl.)
Art2: the

There is an example of a generalized text. The picture - ?The bedroom¦ of
Van Gogh, 1889.
I see an interior. On the right there is a bed. There is a door before the
bed. There is a rack behind the bed. Shirts are on the rack. A chair is at
the left of the bed. Pictures are at the right of the bed. There are
another pictures over the pictures. 
On the left there is a door. There is a chair before the door. There is a
towel behind the door. 
On the left there is a table. Jugs, bottles, a glass, a loaf are on the
table. There is a window over the table. 
On the left there is a mirror.
On the right there is a picture.

The model is rather simple, we do not take into account any factors forming
the salience of an object (ex. the use of an object in a picture influences
its description). The model generates texts for static pictures. In the
near future we will take into account more factors influencing the salience
of an object.