Lexical Databases - Background Paper

InterSign Workshop in Hamburg July 17th to 19th, 1998

Aspects of Lexical Databases: Background Paper

In this workshop we are not dealing with large language corpora in general, but specifically with "Lexical Databases", the purpose of which is to register and document a range of vo-cabulary, to serve as an inventory for a language and its dialects, as a source for learning or teaching sign language, as a reference base for analyzing and interpreting sign language sequences, or as a basis for statistical analysis.

Rationale

We propose to start with an inductive approach: To look at each other´s lexical databases and to decide then what may be relevant to discuss.

Some questions, however, we'll probably have to discuss immediately after having demonstrated different approaches and analysed the structure of different lexical databases and before we enter into a discussion about single phenomena of incompatibility. These questions are:

What is a lexical database for sign language at all?
How do SL databases compare to oral language databases?
For which purposes do we plan to develop lexical databases?
What aims are realistically achievable in SL databases?

S. Conlon, M. Evens, T. Ahlswede, and R. Strutz, "Developing a Large Lexical Database for Information Retrieval, Parsing, and Text Generation Systems", Information Processing and Management, vol. 29, no. 1, pp. 1-17, 1993.

A comparison with spoken language databases might be helpful. Those have following purposes (can we adopt the same functionalities?):

Resource for system development (training, evaluation; construction of stochastic language models)
Definition of vocabulary coverage
Basis for vocabulary consistency maintenance
Reference point for integrating different kinds of lexical information
Source of information for investigation of vocabulary structure.

(see the extensive documentation about lexical databases, lexica, and dictionaries from the EAGLES project)

What might be the purposes we have in mind for lexical databases (LD)?
LD may serve as the basis for sign language dictionaries?
Do we want to develop bilingual and multilingual LD (dictionaries)?
Do we intend to compare national or dialectic data in LD cross-national?
Do we need LD as morphological reference for the automatic recognition and analysis of sign language data?

Some of these functions probably we do not (yet) have in mind, but maybe one day. The consequences of these functions are obvious:

If we use LD as basis for dictionaries, they have to contain morphological information.
What system do we use to compare data in different LDs? We might have to assemble world knowledge in semantic nets and link items from one LD via semantic information to the other LD.
Statistical procedures on LDs can use only morphological information.

Similar thoughts apply when we throw a glimpse on the structure of spoken language databases (from the same source):

Generally fixed record structures, with fields for different types of lexical information, and strings as values in fields
Often identification of lexical key (lexical identifier) with orthographic word form. A problem with orthographic keys, particularly with large vocabularies: is the existence of homographs, i.e. lexical items wth the same spelling but different pronunciation (heterophonous homographs) and/or meaning, a potential source of "orthographic noise". Additional serial numbering may be used to distinguish between homographs.

What is the content of spoken language databases normally? (from the same source):

Main lookup key (in general an orthographic representation, perhaps supplemented by numbering to distinguish homographs)
Database entries may be fully inflected forms, uninflected stems, or morphemes (generally morphs, i.e. the phonemic forms of morphemes), or all of these; other inventories containing units such as phonemes, diphones or syllables, may be included.
Pronunciation (in canonical phonemic representation, perhaps including pronunciation variants
Subword boundaries between units such as syllables, morphs (phonemic forms of affixes, lexical roots ), derived stems and constituents of compound words
Syntactic category (part of speech, POS, e.g. Noun, Adjective, Article, Pronoun, Verb, Adverb, Preposition, Conjunction, Interjection ) or subcategory (e.g. Proper vs. Common Noun, Intransitive vs. Transitive vs. Ditransitive vs. Prepositional, etc., Verb)
Semantic categories (in general scenario-specific, i.e. restricted to a given domain or application)
Corpus information: frequency statistics (of varying complexity, up to sophisticated language models, cf. Chapter 7); concordance information (i.e. list of contexts of occurrence for each word, usually generated on demand); signal annotations
Further information: concordance (textual context), links to speech files.

How would we define the content of sign language LDs? What are for us lexical units (stems, prefixes, compund signs, merged signs, inflected signs, other types of sign formation, idioms)? How do we define lemmata? etc.

Description of the problem

There are several databases for sign language existing in Europe. They have reached different stages of development, they represent a different approach to sign language research, they aim at a different purpose (dictionary, research data, teaching material).

The problem is not so much that a different amount of information is registered, i.e. that some databases register more types of data, others less, the problem is not even that different information at all is registered, but the problem is that the information which is registered is probably registered in a different way.

The problem is that of missing standards for naming variables, unknown standards of data entry, and competing and conflicting methods of data coding, a lack of compatibility.

The advantages of having compatible lexical databases are that

we would be able to adapt tools developed for other databases for the analysis of our own database, for the search in sign language vocabulary etc.
we might be able to parallel the data corpus of different databases either to compare languages or to form multilingual dictionaries and learning programs.

Range of problems

What is the range of problems encountered if we think in terms of the compatibility of the data in different databases? What are the differences between different databases which we would like to compare or make compatible?

Structure of the lexical database

The easiest to overcome compatibility problem probably consists in the different record structure and field definitions of different databases. Once the structure has been published and is defined transparently the data needed can be converted to another database which has a different record structure.

Keys or tags for variables

The compatibility problem starts with the names and abbreviations uses for the keys or tags in a database (lexical key or identifier), e.g. the name for the country tag, the abbreviation for the country code, the name for the language registered, the abbreviation (acronym) of the sign language name, the identifiers for the sign itself, the notation of handshapes and movements etc.

Thomas Hanke has written a paper for the Signing Books project which illustrates how complex this problem is. He demonstrates it for the apparently trivial aspect of country names and language names. The paper is available at the following address on our server:

He also points out that there are ISO norms existing for country and language codes, and his paper contains the links to WWW addresses where these norms are listed.

Data entries

The problem of incompatibilities is continued for the data entries. To give some examples:

Some databases may be only verbal, register names and glosses, others contain transcriptions as well. Thus the range of information covered is differently. This is not very problematic, were not the kind and type of information also different which is the case when

glosses are following different conventions.
transcriptions use different methods, i.e. Stokoe, HamNoSys and others.
semantic coding described the signs following different conventions.

Glossing

Most glossing conventions use special characters, a mixture of alphabetical and graphical symbols.

We have to transform this type of code into 7-bit ASCII code for the computer. This task of converting glossing into a common 7-bit notation is in principle possible and may find a not too difficult solution, whereas the task of converting different transcription notation systems cannot be solved completely.

Notation

It could well be that the transcriptions of sign language even if they follow the same notation method are entered in different fields or as strings which are (according to handshapes, movements, references) and cannot easily be reassembled. But this problem is easier to solve than the next one.

Converting Stokoe to HamNoSys (or HamNoSys to SignPhon) and converting HamNoSys back to Stokoe (or SignPhon back to HamNoSys) leads either to inconsistencies and/or to information loss ("round-trip fidelity problem"). When you convert one notation (e.g. Stokoe or SignPhon) into another (e.g. HamNoSys) and the first knows less symbols for a certain phenomenon to be transcribed than the second it occurs that you have too many choices for converting one symbol into another; if you reverse the procedure and want to regain the first notation from the second this results either in an incorrect solution or an information loss depending how you handle it. The consequences resulting from the irreversibility of the coversion are meant by the term "round-trip fidelity problem".

Simple Example: From Stokoe to HamNoSys (handshape)

Difficult Example: From HamNoSys to SignPhon (shoulder)

Stokoe to HamNoSys and back to Stokoe:

Semantic coding

Semantic coding of language data is achieved in numerous, mostly incompatible ways. One well-known procedure is to construct semantic nets.

Homonymy - Synonymy

Morphology - Reduplication

Strategies of Solution

Compatibility by standardization

One strategy could be to enter into an agreement upon a common format for new databases and to agree to modify one´s own existing database.

Advantages

This strategy (a politician´s dream) would once and for all times solve the problem of compatibility both transnationally and for the owner of the individual database

Disadvantages

This is the general problem with introducing a standard for data collections when substantial amounts of data have already been collected:

The task of making one´s database compatibel originates for each owner of a database, and the amount of work it needs for each partner to adapt one´s own database to the compatibility criteria could be tremendous.

Such a standard will lack acceptance if, for any group, it turns out to be less powerful than what was used before. The consequence might be a "one fits all"-standard with the danger that for each group far more data has to be entered than they feel necessary.

Compatibility by conversion

Another strategy could consist in aggreeing upon a common format as a kind of "database interlingua" and to write programs which could be used for converting automatically national databases into a common European format. The database inter-lingua could consist in its simplest form only a kind of conversion table with reference numbers with no claim to standardize the items registered.

The easiest way to go would be a least common denominator approach. This, however, might turn out to be of little value if some of the coding conventions to be considered are mutually incompatible (cf. round-trip fidelity problem). Furthermore, a solid definition of each of the coding conventions used so far would be necessary in order to define the transformation rules.

The resulting standard would probably not be acceptable (since not powerful enough) for new projects. So, it might be more like a pidgin than a lingua franca.

Advantages

The task for each owner of a database is easier, and the work load much lower; once the common format has been agreed each partner has only to define the deviations in his or her database but not to modify the existing data entries.

Disadvantages

The task of writing a compatibility program for converting existing databases is quite difficult, the work load rests on the shoulders of only one partner; and none of the partners probably has the funds to overtake the responsability for this task.

And achieving compatibility has certain unpleasant consequences: Either a loss of information or a number of incorrect solutions.

Standardization and Conversion

How can the two approaches be combined, i. e. is there a coding convention that is powerful enough to be used by each participant for new projects AND serve as an interlingua? One idea might be to define a coding schema as a granularity pyramid, i.e. it is designed as a stack of mutually compatible coding schemes that differ in detailedness. Any actual coding consists of a detailedness specifier and the actual data. A transformation yields in a coding of the highest detailedness that can be achieved taking the source data into account. If two codings from different levels are to be compared, the coding with the higher detailedness is reduced to the level of the other coding.

Another strategy - especially for multilingual databases - might be to resist the temptation to homogenize the entries and work with plain HTML linking instead.

Criteria for decision

ISO norms

see the paper of Thomas Hanke for country names and language names; it could well be one of our objectives to identify areas in which ISO norms exist for similar phenomena; it is, however, most probably not our aim (and competence) to start to define new ISO norms.

Standardization below ISO

it may well be our aim to standardize data registry out of an linguistic view point in order to reach the same level of standardization for sign languages which is existing for verbal and oral languages (e.g. IPA).

Pragmatical reasons for conventions

even if we would not intend to reach a level of standardization, it should well be our aim to define conventions enabling us to achieve certain pragmatic goals for instance to initiate comparative linguistic research (e.g. contrastive grammar, sign language statistics) or in order to develop multilingual applications for teaching and learning sign language.

Thomas.Hanke (at) sign-lang.uni-hamburg.de - Rolf.Schulmeister (at) sign-lang.uni-hamburg.de

List of workshop papers