The Sign Language Dataset Compendium


About

The information provided in the compendium is compiled from public resource documentation, research articles, inspection of public data and personal correspondence with resource creators. Each compendium entry consists of a free-form text description, a structured info table and a list of references.

As we follow the terminology of each individual resource, differences in terminology, such as different size indications (sign, token, type) or the use of deaf vs. Deaf, may occur. Where possible we use consistent terminology, enriched with comments if needed. All entries are interconnected, providing links between related resources, between languages and resources and between tasks and corpora. Resources can be filtered using keywords.

This page outlines the curation criteria that resources must meet to be considered for inclusion, followed by descriptions of how different entry types of the compendium are structured.

Curation Criteria

The goal of the compendium is to help researchers find data that represent each language as it is used naturally by signers with L1 language proficiency. Corpora should contain (semi-)spontaneous language production rather than prepared utterances or translations of spoken language content. As such it does not cover interpreted television broadcasts or language acquisition datasets.

In selecting suitable curation criteria, we also had to take into account that there exist strong imbalances between languages in the size and number of available resources. To address this we chose a two-tiered approach of minimum and strict requirements. All resources must meet the minimum requirements, but if some resources for a given language also meet the strict requirements, other resources for that language are not (yet) listed. The conditions are applied to corpora and lexical resources separately, so a language can be subject to strict conditions for one and minimum restrictions for the other. This regulates the number of included resources for comparatively well-resourced languages without disqualifying less-resourced languages entirely.

Developing the criteria was an organic process that went hand in hand with the inspection of potential resources. They may be adjusted further as the compendium grows over time.

The curation criteria for the compendium are as follows:

General Criteria for Resources

  1. Must include video data
  2. No sign-supported systems
  3. No language acquisition data
  4. No historical sign languages
  5. Data must be attainable

Criteria for Corpora

  1. Must be (semi-)spontaneous signing
  2. L1 signers
  3. Data must have at least a partial translation and/or gloss annotation
  4. At least 5 hours (minimum) or 10 hours (strict) of sign language recordings. (Multilingual corpora are exempt.)

Criteria for Lexical Resources

  1. Must include index
  2. At least 100 (minimum) or 1000 (strict) different signs. Multilingual corpora are exempt.

Criteria for Data Collection Tasks

  1. Used by multiple resources

Corpus Entries

Each corpus entry consists of a brief description, followed by an information table. The Cite As section specifies how the resource itself should be cited according to its creators. If the corpus contains common data collection tasks, they are outlined in a series of short tables. The last elements are a list of articles mentioned in the entry and the date when the entry was last inspected.

Information Table for Corpus Entries
LanguageThe languages used in the primary data of the corpus. Does not include languages used in annotation or translation.
SizeSize of the corpus. Depending on the information available, this may be specified as token count, type count, recording hours, number of video clips and/or file size.
ParticipantsDemographic information about the corpus participants. Apart from the number of participants this may include which regions they are from, age groups, gender distribution, and more. It is limited to demographic information that has been publicly documented.
Metadata FormatThe file formats in which machine-readable metadata is provided by the corpus.
TranslationWhich languages the primary data is translated into and how much of it has been translated.
AnnotationHow much data has been annotated and which annotation conventions were used. If possible, a reference to the conventions is provided, otherwise information is paraphrased.
Data FormatThe file formats in which the annotation/translation data of the corpus is provided.
LicenceThe licence conditions for using the dataset. These may be commonly used licences such as those by Creative Commons or custom licences defined for the dataset. A link to the licence is provided where possible.
AccessDescribes how public and restricted data can be accessed. If the dataset has both public and restricted parts, this category identifies which parts of it are public. The termonology used for this category is to be understood as follows:
  1. Public access via browsable homepages: portal for non-scientific audience where the data is shown, in most cases in video format only with no annotation or subtitles only
  2. Open access: one can look directly into the data via a homepage
  3. Restricted access: access is restricted, no detailed information on ways of access is given
  4. Restricted access with registration: one has to register to look at the data; registration is handled automatically and will work within very short time
  5. Restricted access requires confirmed registration: one has to register to look at the data; registration is handled manually and may ask for information on affiliation, plans for data usage or reasons for access
  6. Restricted access requires (individual) license agreement: a contract between the data holder or owner and oneself has to be made
WebpagesA list of relevant websites, such as those for the project, the research dataset, or portals for access by the general public.
InstitutionList of the universities or other organisations by which the dataset was created.
PublicationsImportant bibliographic references for the resource. If an external list of publications for the resource exist, a link to it is included here.
Table of Common Task Used in this Corpus
TaskThe data collection task in question. Provides a link to the task entry.
# recordings – open accessThe number of recordings that are available in the publicly accessible part of the corpus.
# recordings – restricted accessThe number of recordings that are only available in the non-public part of the corpus.
Data availableLinks to the corpus recordings of this task, where available. Where possible these links will connect only to the given task; otherwise disambiguating notes are provided to help find the task on the referenced page.

Lexical Resource Entries

The collection of lexical resources includes both lexical databases as well as digital dictionaries. Each lexical resource entry consists of a brief description, followed by an information table. The Cite As section specifies how the resource itself should be cited according to its creators. The last elements are a list of articles mentioned in the entry and the date when the entry was last inspected.

Information Table for Lexical Resource Entries
LanguagesThe languages used in the lexical resource. As most lexical resources can be used as bilingual dictionaries to some degree, this covers both signed and spoken languages.
SizeNumber of lexical items. Items are identified as signs or types depending on the resource.
Linguistic InformationWhich linguistic information is provided for lexical items, such as ID-glosses, translational equivalents, citation form video, meanings, phonetic transcription or categorisations, frequency and other statistics, list of corpus occurrences and more.
LicenceThe licence conditions for using the dataset. These may be commonly used licences such as those by Creative Commons or custom licences defined for the dataset. A link to the licence is provided where possible.
AccessDescribes how public and restricted data can be accessed. If the dataset has both public and restricted parts, this category identifies which parts of it are public. The termonology used for this category is to be understood as follows:
  1. Public access via browsable homepages: portal for non-scientific audience where the data is shown, in most cases in video format only with no annotation or subtitles only
  2. Open access: one can look directly into the data via a homepage
  3. Restricted access: access is restricted, no detailed information on ways of access is given
  4. Restricted access with registration: one has to register to look at the data; registration is handled automatically and will work within very short time
  5. Restricted access requires confirmed registration: one has to register to look at the data; registration is handled manually and may ask for information on affiliation, plans for data usage or reasons for access
  6. Restricted access requires (individual) license agreement: a contract between the data holder or owner and oneself has to be made
WebpagesA list of relevant websites, such as those for the project, the research dataset, or portals for access by the general public.
InstitutionList of the universities or other organisations by which the dataset was created.
PublicationsImportant bibliographic references for the resource. If an external list of publications for the resource exist, a link to it is included here.

Data Collection Task Entries

During corpus data collection, participants are guided by a series of tasks, such as retelling a story or open discussion of a given topic. The compendium provides a collection of commonly used such tasks. This collection is intended to help with finding corpora that have comparable contents. Individual entries may cover broad concepts, such as open discussion or specific materials, such as a specific story to be retold.

Each data collection task entry consist of a brief description, an information table and a series of tables detailing occurrences of the task in specific corpora. These task-corpus pairings are further subdivided by language, so multilingual corpora may be covered by multiple tables. This is followed by a list of references, where applicable, and the date when the entry was last inspected.

Information Table for Data Collection Task Entries
StimulusBrief description of the stimulus provided to participants.
TargetThe linguistic phenomena that the task is intended to elicit.
Degree of InteractionAn estimate whether the task usually results in a low, medium or high amount of interaction between participants. A reason for the degree may be given as a comment.
DurationAn estimate of how long the tasks usually lasts, based on instances observed in corpus data or published documentation.
SourceReferences to the material used in the task (e.g. books, films) or to scientific publications providing a definition of the task.
Table of a Corpus Occurrence of the Task
CorpusThe corpus in question. Provides a link to the corpus entry.
LanguageThe language used for this task in the given corpus. If the task occurs with multiple languages in a corpus, a separate table for each language is given. Provides a link to the language entry.
# recordings – open accessThe number of recordings that are available in the publicly accessible part of the corpus.
# recordings – restricted accessThe number of recordings that are only available in the non-public part of the corpus.
Data availableLinks to the corpus recordings of this task, where available. Where possible these links will connect only to the given task; otherwise disambiguating notes are provided to help find the task on the referenced page.

Language Entries

The compendium provides an index of the languages covered by its resources. Information on languages is taken from Glottolog, Ethnologue and, in a few cases, Wikipedia.

As sign languages often go by a number of different English and local names and acronyms, we list commonly used ones, roughly sorted by which variants are preferred within the language community and by how commonly they are used locally and in research. The most preferred English name and (where applicable) most preferred acronym of each language are shown in the language index. However, each language can still be found by all its other names, acronyms and identifiers by typing them in the provided search filter.

Due to the extensive number of languages and our limited knowledge of many of them, we cannot guarantee that all of the listed names are correct or in use. Should we be missing a name or list an outdated or even discriminatory/offensive name, please contact contact us and we will correct the entry.

Each language lists a variety of common names and identifiers for it, followed by lists of corpora and lexical resources that contain data for the language.

Language Names and Identifiers
ISO 639-3The unique identifier of the language in the ISO 639-3 code table.
GlottologThe unique glottocode identifier of the language in the Glottolog database.
AcronymsLanguage acronyms commonly used by the language community or in research publications.
English namesEnglish names for the language.
Local namesNames for the language used in its native region. So far this is limited to languages with a written form, which unfortunately prevents the representation of sign language names in their own language. For names written in other scripts than the latin alphabet, a transliteration is also provided.