Database model for storing linguistic information
        
        A.Pominov, pom@rsc.msk.su 
        Researcher for Multitran Automatic Dictionaries


Abstract
An approach for the creation of specialized database is considered basing
on the model implemented in Multitran Automatic Dictionary. The concept
includes the optimized highly ramified B-tree with varied record size. The
model enables quick data access, appending and modification features and
was specially developed for storage of linguistic information in computer
dictionaries.

The storage of information in linguistic systems (machine translation,
computer dictionaries, spell-checkers and hypertext systems) presents a
number of special requirements for the database model compared to other
applications.

Processing of text data requires the following features of the information
system: capability to process large numbers of records (from hundreds of
thousands to several billion entries), preferred handling of varied record
size (e.g., in Russian language words could be from 1 to more than 30
letter long), high speed of operation for large amounts of data,
possibility to modify or delete existing records and append new records.

The usage of standard database management packages such as Fox, Clipper,
Paradox and others seems inappropriate due to the following reasons:
Fixed record size in the relational models of available packages
significantly increases database size which adversely affects the speed of
basic operations. Even splitting oversize entries to several records would
mean some loss of space as well as necessity of additional processing.
Storage of databases and indexes on key fields as different files. This
feature creates data redundancy and additional loss of file space and
search time.
Low speed of data processing for large amounts of data. Generally the
standard DBMS systems operate with appropriate efficiency for databases
containing several hundred thousands entries, the index access time would
increase in proportion to the database size.

The mentioned limitations stipulate a need to create a specialized database
that would meet the requirements of text processing. The proposed model is
based on the widely popular B-tree structure optimized to overcome the
indicated problems.

The basic principle underlying the approach is the observation that the
speed of data processing is in reverse proportion to the size of data. We
could use this conjuncture to improve one parameter of the system, such as
speed by optimizing the physical database size. By excluding data
redundancy and eliminating unnecessary usage of record space, on the one
hand, and using methods of quick data access we could obtain acceptable
results for building practical linguistic systems.

The consideration of suggested optimizations implemented in the structure
of Multitran computer dictionary could be conducted in two stages.

1. 	Methods of reducing database size
First of all each type of data in the system is presented as a single file
with records containing two fields - the key and information field. Thus
each database contains both the useful information and means of indexed
access at the same time. Key field is used for indexed data search and
contains search value corresponding to the given record. Information fields
contains data associated with the key field that is retrieved when key
field search is successful. There could be only one index field for each
database but further description shows that it is actually enough for each
type of data used in the automatic dictionary.

The data types used in automatic dictionary are stem dictionary,
translation index and translation dictionary. Stem dictionary stores stems
(or pseudostems) of words in the key field and morphology data in the
information field. This data is used for morphology analysis and synthesis.
Translation index contains word number(s) in the key field and
corresponding term number(s) in the information field. Translation database
stores term number in the key field and translation entry proper in the
information field. The usage of floating record size in each of these
instances allows to significantly reduce the database size.
The other assumption is the maximum record size. For the given task of
storing linguistic data the upper limit is set to 255 bytes for key field
and 255 bytes for information one. The size of each record, therefore,
varies from 0 to 510 bytes.

2. 	Methods optimizing data access in the binary tree
The concept of binary tree implies several levels of pages with the search
starting with the single top-level page and continuing down until the
lowest level is reached. Data are stored in such a way as to allow sorted
access. Within the framework of the considered model the actual data are
stored only at the lowest level of pages while the upper-level pages
contain the reference data needed to perform ordered searches. Upper levels
contain the control pages and the lowest levels store the information
pages. The first control page that starts the search in the binary tree is
called the root page. Each database consists of several levels of control
pages and a single level of information pages. Each page contains records
storing sizes of its key and information fields in the first two bytes
followed by the values of these fields. The key fields are placed in an
increasing order throughout the page and the whole index while information
fields could contain any data. Thus the size of each field in the record
varies from 0 to 255 bytes and the page is filled without loss of space
until it overflows during appending.
Each key field of the control page contains the minimal differing
substrings of the last record on the information page and the first record
of the information page directly following it by the order of key fields.
To consider the English stem dictionary, if the last record of the
information page contains 'medium' as the key field of the last record and
the next information page contains 'meet' as the key field of the first
record, the control level record corresponding to the first information
page would contain 'mee' in the key field and offset (or number) of the
first page in its information field.





				mee->1	 	      Control 
level 





Information level 1		Information level 2




		... medium			meet  ...


		last record		       first record




Data search

The search starts from the beginning of the root page and stops at the
record whose key field is equal or smaller than the search string. The
procedure continues at the page specified in the information field of the
found record until the information page is encountered. The result of the
search in the information page is used as the overall result of the seek
operation, the search string being either equal, smaller or bigger than the
found record. If several similar records exist, search procedure stops at
the first of them.

Appending records to the index

Appending of the new record starts from the current database position
specified by the search operation performed for the new record. If several
coinciding records exist the appending takes place after the last of them.
If there is enough free space on the page appending of the new record is
carried out in the order of key fields. Due to the fact that page overflow
is relatively rare the basic operations needed for appending are the search
of the record, modification of the page information and writing of the page
back to disk. 

If there is not enough space on the page for the new record the overflow
procedure is performed. The standard practice with the binary trees is
splitting of the page in two portions of the same size to allow the
balanced growth of the tree. However, on the one hand, when the  floating
record size is used is not possible to split the page in two similar
portions. On the other hand, at the stage of overflow some optimizations
can improve overall database performance. Overflow procedure handling page
splitting produces the delimiter appended to the upper level. The size of
the delimiter directly affects the binary tree growth behavior. Namely, the
shorter is the delimiter the slower would be the growth of the tree when
the new entries are appended. For this reason it is effective to form the
delimiter of the shortest possible size. The simplest way is to use the
delimiter formed as the shortest differing portion of two successive
records. Moreover, to find the split position it is possible to scan a
number of records in the vicinity of the page center to determine the
records with the shortest differing portions. On the basis of this scheme
the delimiter is formed and appended to the corresponding upper level
control page.

The first portion of the overflowing page remains in its place in the file
while the second is written to the end of file and forms the new page. The
delimiter should be appended to the control page maintaining the indexed 
order of the key fields. The appending to the upper page could also cause
overflow that is handled in the same manner. The difference is that the
records of the control page are shorter in size and would form the shorter
delimiters. If the appending causes overflow of the root page the new root
page is formed increasing the height of the tree. This would also increase
the search time for subsequent seek operations because the speed is in
reverse proportion to the tree height.

When the overflown page is split it is possible to append the new record
that initially caused the overflow. As the result of the split two pages
were formed each approximately half empty so irrespectively of the split
point new record should fit to any of these pages.

Skipping to next/previous record
Skipping to next record means adding current record size to the current
record offset. If this operation brings record pointer to the end of page
we should read the corresponding control page, move to the next record and
read corresponding information page. This process could mean reading all
control pages up to the root page. This process ensures the sequential
access to the database records in the order of key fields.
Skipping to previous record involves basically the same operations but the
calculation of the previous record on the page should always start from the
beginning of page. This happens due to the fact that position of the next
record could be calculated by the size of the current record, whereas there
is no data to calculate the size of the previous record. However, forward
and backward skip operations are carried out in memory and take about the
same time.

Deleting records
Deleting the current record means moving all the following records on the
page to the left by the size of the current record and writing the changed
page back to disk.

Quick move to the next record
Moving to the next record is a frequent operation when scanning through the
whole database. Using the described method of forward skip when an end of
page is encountered it is necessary to read at least one control page
before the next information page is reached. It approximately doubles the
time of scanning the database. To speed up this operation the quick skip
method is provided. It uses the direct reference to the next information
page stored at the end of each page. It allows to scan the database reading
only the information pages and skipping the control ones that store no user
data.

The limitations of the model

The described optimization offers some gains in access time and usage of
disk space but also sets limitations on the applications that implement
this approach.

1. Limitation on the sizes of key and information fields. Because each
field size is represented as one byte it cannot exceed 255 bytes. Within
the framework of linguistic model this has the following consequences:
Stem dictionary
It is evident that the pseudostem of the word cannot be longer than 255
characters. Taking into account the fact that morphology information in
Multitran system occupies 6 bytes for each pseudostem each record could
reference 36 morphological blocks. This amount is quite enough for any
language, in practice the maximum number of information blocks doesn't
exceed 16.

Translation index
Key field of the translation index record stores the word numbers in
phrase. For the used word number size of 3 bytes the maximum number of
words in phrase could reach 85, whereas there are hardly any stable phrases
containing more than 10 words. The information field contains term numbers
corresponding to the current word or phrase. Each index record therefore
could reference 85 translation entries. Bearing in mind that this number of
translations corresponds to single part of speech, each word could have 85
translations as a noun, 85 as an adjective, etc. This number of
translations is enough for most cases.

Translation dictionary
The key field contains the term number and is always 3 bytes long.
Information field stores the translation entry proper that has the
arbitrary size. In fact if some lengthy comment are present it could exceed
in size 255 characters. For long entries it is necessary to split them in
parts and store as different records with the same term number.

Thus the limitation of the record size is not very critical for any part of
the system.

2. 	Index size limitation
For the considered binary tree model there is the limitation for the
maximum database file size but not for the number of records. The actual
number of records depends upon record sizes and overflow handling, but the
only parameter that sets the limit is the page number. With the currently
used 1-Kb page model the maximum file size is 65 Mb. This amount is enough
to store about 3 billion translation dictionary records that are longest in
size. Stem dictionaries and translation indexes could hardly reach such
sizes.

3. 	Search speed limitation
Binary tree search time decreases in stages with the increase of tree
height and depends upon the number of records. However this dependence is
by no means proportional. As an average for databases with 10.000 records
each search operation uses one disk access. With the further increase of
the database record number the search time would increase to two disk
accesses and keep stable for the next billion records.

4. 	Data redundancy and file size usage
Despite the optimization the database stores internal system information to
maintain data interrelation. This information occupies two bytes for each
record. According to the binary tree theory each page is at least half
filled with data. In the optimized case it could be 100% full. The average
page is approximately 75% full with data.

5. 	Deleting the records
Theoretically deleting of the records could lead to the deleting of the
last record from the page leaving it empty. When such situation is
encountered it is reasonable to delete the corresponding record from the
control page and store the free page the free page list for the subsequent
usage. Successive entry removals could cause the decrease of tree height.
However, the computer dictionary systems use appending functions more
frequently than deleting of the records. For this reason the mode of data
removal was not fully implemented and the last record cannot be deleted
from the page.

8. 	Building of the index
The building of the index from the array of unsorted data is a problem in
its own way. When such problem arises the indexing is implemented by
successive appending of records to the index file. 

It is evident that the proposed binary tree model is not flawless but the
limitations do not significantly affect its usefulness.

Further improvements

The following steps are possible to further increase the effectiveness of
the model:

1. 	Increasing the page size. It leads to increase in maximum possible file
size and decrease in the height. As was mentioned before, the height of the
binary tree remains constant in the wide range of record number and is
about the only parameter affecting the search speed. For each database type
it is possible to select certain page size that would correspond to the
average number of  records in the database and keep tree height as low as
possible.
2. 	Data compression on the page could also lead to the decrease of search
time and database size.
3. 	Further optimization of data on the control pages. The actual data in
the index are stored only at the lowest level of pages. It could be noted
that the tree height increases when the control pages overflow because the
information pages always remain at the lowest level. Optimizing data
storage at the control pages it is possible to make a binary tree more
ramified and slowly growing. It is possible, for example, to discard the
byte describing the size of information field in each record on the control
page, because the information field stores the corresponding lower level
page number and is always three bytes long. Considering the small size of
the records on control pages this change could make the database at least
20% more effective with respect to the number of records it could store
without overflowing.

Currently Multitran dictionary stores more than 1 billion records in all
databases created using the described principles and allows quick search,
appending and modification of linguistic information. The total size of all
databases does not exceed 20 Mb.