Database model for storing linguistic information A.Pominov, firstname.lastname@example.org Researcher for Multitran Automatic Dictionaries Abstract An approach for the creation of specialized database is considered basing on the model implemented in Multitran Automatic Dictionary. The concept includes the optimized highly ramified B-tree with varied record size. The model enables quick data access, appending and modification features and was specially developed for storage of linguistic information in computer dictionaries. The storage of information in linguistic systems (machine translation, computer dictionaries, spell-checkers and hypertext systems) presents a number of special requirements for the database model compared to other applications. Processing of text data requires the following features of the information system: capability to process large numbers of records (from hundreds of thousands to several billion entries), preferred handling of varied record size (e.g., in Russian language words could be from 1 to more than 30 letter long), high speed of operation for large amounts of data, possibility to modify or delete existing records and append new records. The usage of standard database management packages such as Fox, Clipper, Paradox and others seems inappropriate due to the following reasons: Fixed record size in the relational models of available packages significantly increases database size which adversely affects the speed of basic operations. Even splitting oversize entries to several records would mean some loss of space as well as necessity of additional processing. Storage of databases and indexes on key fields as different files. This feature creates data redundancy and additional loss of file space and search time. Low speed of data processing for large amounts of data. Generally the standard DBMS systems operate with appropriate efficiency for databases containing several hundred thousands entries, the index access time would increase in proportion to the database size. The mentioned limitations stipulate a need to create a specialized database that would meet the requirements of text processing. The proposed model is based on the widely popular B-tree structure optimized to overcome the indicated problems. The basic principle underlying the approach is the observation that the speed of data processing is in reverse proportion to the size of data. We could use this conjuncture to improve one parameter of the system, such as speed by optimizing the physical database size. By excluding data redundancy and eliminating unnecessary usage of record space, on the one hand, and using methods of quick data access we could obtain acceptable results for building practical linguistic systems. The consideration of suggested optimizations implemented in the structure of Multitran computer dictionary could be conducted in two stages. 1. Methods of reducing database size First of all each type of data in the system is presented as a single file with records containing two fields - the key and information field. Thus each database contains both the useful information and means of indexed access at the same time. Key field is used for indexed data search and contains search value corresponding to the given record. Information fields contains data associated with the key field that is retrieved when key field search is successful. There could be only one index field for each database but further description shows that it is actually enough for each type of data used in the automatic dictionary. The data types used in automatic dictionary are stem dictionary, translation index and translation dictionary. Stem dictionary stores stems (or pseudostems) of words in the key field and morphology data in the information field. This data is used for morphology analysis and synthesis. Translation index contains word number(s) in the key field and corresponding term number(s) in the information field. Translation database stores term number in the key field and translation entry proper in the information field. The usage of floating record size in each of these instances allows to significantly reduce the database size. The other assumption is the maximum record size. For the given task of storing linguistic data the upper limit is set to 255 bytes for key field and 255 bytes for information one. The size of each record, therefore, varies from 0 to 510 bytes. 2. Methods optimizing data access in the binary tree The concept of binary tree implies several levels of pages with the search starting with the single top-level page and continuing down until the lowest level is reached. Data are stored in such a way as to allow sorted access. Within the framework of the considered model the actual data are stored only at the lowest level of pages while the upper-level pages contain the reference data needed to perform ordered searches. Upper levels contain the control pages and the lowest levels store the information pages. The first control page that starts the search in the binary tree is called the root page. Each database consists of several levels of control pages and a single level of information pages. Each page contains records storing sizes of its key and information fields in the first two bytes followed by the values of these fields. The key fields are placed in an increasing order throughout the page and the whole index while information fields could contain any data. Thus the size of each field in the record varies from 0 to 255 bytes and the page is filled without loss of space until it overflows during appending. Each key field of the control page contains the minimal differing substrings of the last record on the information page and the first record of the information page directly following it by the order of key fields. To consider the English stem dictionary, if the last record of the information page contains 'medium' as the key field of the last record and the next information page contains 'meet' as the key field of the first record, the control level record corresponding to the first information page would contain 'mee' in the key field and offset (or number) of the first page in its information field. mee->1 Control level Information level 1 Information level 2 ... medium meet ... last record first record Data search The search starts from the beginning of the root page and stops at the record whose key field is equal or smaller than the search string. The procedure continues at the page specified in the information field of the found record until the information page is encountered. The result of the search in the information page is used as the overall result of the seek operation, the search string being either equal, smaller or bigger than the found record. If several similar records exist, search procedure stops at the first of them. Appending records to the index Appending of the new record starts from the current database position specified by the search operation performed for the new record. If several coinciding records exist the appending takes place after the last of them. If there is enough free space on the page appending of the new record is carried out in the order of key fields. Due to the fact that page overflow is relatively rare the basic operations needed for appending are the search of the record, modification of the page information and writing of the page back to disk. If there is not enough space on the page for the new record the overflow procedure is performed. The standard practice with the binary trees is splitting of the page in two portions of the same size to allow the balanced growth of the tree. However, on the one hand, when the floating record size is used is not possible to split the page in two similar portions. On the other hand, at the stage of overflow some optimizations can improve overall database performance. Overflow procedure handling page splitting produces the delimiter appended to the upper level. The size of the delimiter directly affects the binary tree growth behavior. Namely, the shorter is the delimiter the slower would be the growth of the tree when the new entries are appended. For this reason it is effective to form the delimiter of the shortest possible size. The simplest way is to use the delimiter formed as the shortest differing portion of two successive records. Moreover, to find the split position it is possible to scan a number of records in the vicinity of the page center to determine the records with the shortest differing portions. On the basis of this scheme the delimiter is formed and appended to the corresponding upper level control page. The first portion of the overflowing page remains in its place in the file while the second is written to the end of file and forms the new page. The delimiter should be appended to the control page maintaining the indexed order of the key fields. The appending to the upper page could also cause overflow that is handled in the same manner. The difference is that the records of the control page are shorter in size and would form the shorter delimiters. If the appending causes overflow of the root page the new root page is formed increasing the height of the tree. This would also increase the search time for subsequent seek operations because the speed is in reverse proportion to the tree height. When the overflown page is split it is possible to append the new record that initially caused the overflow. As the result of the split two pages were formed each approximately half empty so irrespectively of the split point new record should fit to any of these pages. Skipping to next/previous record Skipping to next record means adding current record size to the current record offset. If this operation brings record pointer to the end of page we should read the corresponding control page, move to the next record and read corresponding information page. This process could mean reading all control pages up to the root page. This process ensures the sequential access to the database records in the order of key fields. Skipping to previous record involves basically the same operations but the calculation of the previous record on the page should always start from the beginning of page. This happens due to the fact that position of the next record could be calculated by the size of the current record, whereas there is no data to calculate the size of the previous record. However, forward and backward skip operations are carried out in memory and take about the same time. Deleting records Deleting the current record means moving all the following records on the page to the left by the size of the current record and writing the changed page back to disk. Quick move to the next record Moving to the next record is a frequent operation when scanning through the whole database. Using the described method of forward skip when an end of page is encountered it is necessary to read at least one control page before the next information page is reached. It approximately doubles the time of scanning the database. To speed up this operation the quick skip method is provided. It uses the direct reference to the next information page stored at the end of each page. It allows to scan the database reading only the information pages and skipping the control ones that store no user data. The limitations of the model The described optimization offers some gains in access time and usage of disk space but also sets limitations on the applications that implement this approach. 1. Limitation on the sizes of key and information fields. Because each field size is represented as one byte it cannot exceed 255 bytes. Within the framework of linguistic model this has the following consequences: Stem dictionary It is evident that the pseudostem of the word cannot be longer than 255 characters. Taking into account the fact that morphology information in Multitran system occupies 6 bytes for each pseudostem each record could reference 36 morphological blocks. This amount is quite enough for any language, in practice the maximum number of information blocks doesn't exceed 16. Translation index Key field of the translation index record stores the word numbers in phrase. For the used word number size of 3 bytes the maximum number of words in phrase could reach 85, whereas there are hardly any stable phrases containing more than 10 words. The information field contains term numbers corresponding to the current word or phrase. Each index record therefore could reference 85 translation entries. Bearing in mind that this number of translations corresponds to single part of speech, each word could have 85 translations as a noun, 85 as an adjective, etc. This number of translations is enough for most cases. Translation dictionary The key field contains the term number and is always 3 bytes long. Information field stores the translation entry proper that has the arbitrary size. In fact if some lengthy comment are present it could exceed in size 255 characters. For long entries it is necessary to split them in parts and store as different records with the same term number. Thus the limitation of the record size is not very critical for any part of the system. 2. Index size limitation For the considered binary tree model there is the limitation for the maximum database file size but not for the number of records. The actual number of records depends upon record sizes and overflow handling, but the only parameter that sets the limit is the page number. With the currently used 1-Kb page model the maximum file size is 65 Mb. This amount is enough to store about 3 billion translation dictionary records that are longest in size. Stem dictionaries and translation indexes could hardly reach such sizes. 3. Search speed limitation Binary tree search time decreases in stages with the increase of tree height and depends upon the number of records. However this dependence is by no means proportional. As an average for databases with 10.000 records each search operation uses one disk access. With the further increase of the database record number the search time would increase to two disk accesses and keep stable for the next billion records. 4. Data redundancy and file size usage Despite the optimization the database stores internal system information to maintain data interrelation. This information occupies two bytes for each record. According to the binary tree theory each page is at least half filled with data. In the optimized case it could be 100% full. The average page is approximately 75% full with data. 5. Deleting the records Theoretically deleting of the records could lead to the deleting of the last record from the page leaving it empty. When such situation is encountered it is reasonable to delete the corresponding record from the control page and store the free page the free page list for the subsequent usage. Successive entry removals could cause the decrease of tree height. However, the computer dictionary systems use appending functions more frequently than deleting of the records. For this reason the mode of data removal was not fully implemented and the last record cannot be deleted from the page. 8. Building of the index The building of the index from the array of unsorted data is a problem in its own way. When such problem arises the indexing is implemented by successive appending of records to the index file. It is evident that the proposed binary tree model is not flawless but the limitations do not significantly affect its usefulness. Further improvements The following steps are possible to further increase the effectiveness of the model: 1. Increasing the page size. It leads to increase in maximum possible file size and decrease in the height. As was mentioned before, the height of the binary tree remains constant in the wide range of record number and is about the only parameter affecting the search speed. For each database type it is possible to select certain page size that would correspond to the average number of records in the database and keep tree height as low as possible. 2. Data compression on the page could also lead to the decrease of search time and database size. 3. Further optimization of data on the control pages. The actual data in the index are stored only at the lowest level of pages. It could be noted that the tree height increases when the control pages overflow because the information pages always remain at the lowest level. Optimizing data storage at the control pages it is possible to make a binary tree more ramified and slowly growing. It is possible, for example, to discard the byte describing the size of information field in each record on the control page, because the information field stores the corresponding lower level page number and is always three bytes long. Considering the small size of the records on control pages this change could make the database at least 20% more effective with respect to the number of records it could store without overflowing. Currently Multitran dictionary stores more than 1 billion records in all databases created using the described principles and allows quick search, appending and modification of linguistic information. The total size of all databases does not exceed 20 Mb.