About
The information provided in the compendium is compiled from public resource documentation, research articles, inspection of public data and personal correspondence with resource creators. Each compendium entry consists of a free-form text description, a structured info table and a list of references.
As we follow the terminology of each individual resource, differences in terminology, such as different size indications (sign, token, type) or the use of deaf vs. Deaf, may occur. Where possible we use consistent terminology, enriched with comments if needed. All entries are interconnected, providing links between related resources, between languages and resources and between tasks and corpora. The index page of each category has a text filter field that allows you to quickly search its content by entering (partial) names and identifiers. If a link to an external page does not work or seems to not contain the right content anymore, you can also click the icon behind the link, which will take you to a backup copy of the page that was archived by the Internet Archive Wayback Machine.
This page outlines the curation criteria that resources must meet to be considered for inclusion, followed by descriptions of how different entry types of the compendium are structured.
Curation Criteria
The goal of the compendium is to help researchers find data that represent each language as it is used naturally by signers with L1 language proficiency. Corpora should contain (semi-)spontaneous language production rather than prepared utterances or translations of spoken language content. As such it does not cover interpreted television broadcasts or language acquisition datasets.
In selecting suitable curation criteria, we also had to take into account that there exist strong imbalances between languages in the size and number of available resources. To address this we chose a two-tiered approach of minimum and strict requirements. All resources must meet the minimum requirements, but if some resources for a given language also meet the strict requirements, other resources for that language are not (yet) listed. The conditions are applied to corpora and lexical resources separately, so a language can be subject to strict conditions for one and minimum restrictions for the other. This regulates the number of included resources for comparatively well-resourced languages without disqualifying less-resourced languages entirely.
Developing the criteria was an organic process that went hand in hand with the inspection of potential resources. They may be adjusted further as the compendium grows over time.
The curation criteria for the compendium are as follows:
General Criteria for Resources
- Must include video data
- No sign-supported systems
- No language acquisition data
- No historical sign languages
- Data must be attainable
Criteria for Corpora
- Must be (semi-)spontaneous signing
- L1 signers
- Data must have at least a partial translation and/or gloss annotation
- At least 5 hours (minimum) or 10 hours (strict) of sign language recordings. (Multilingual corpora are exempt.)
Criteria for Lexical Resources
- Must include index
- At least 100 (minimum) or 1000 (strict) different signs. Multilingual resources are exempt.
Criteria for Data Collection Tasks
- Used by multiple resources
Corpus Entries
Each corpus entry consists of a brief description, followed by an information table. The Cite As
section specifies how the resource itself should be cited according to its creators. If the corpus contains common data collection tasks, they are outlined in a series of short tables. The last elements are a list of articles mentioned in the entry and the date when the entry was last inspected.
Language | The languages used in the primary data of the corpus. Does not include languages used in annotation or translation. |
---|---|
Size | Size of the corpus. Depending on the information available, this may be specified as token count, type count, recording hours, number of video clips and/or file size. |
Participants | Demographic information about the corpus participants. Apart from the number of participants this may include which regions they are from, age groups, gender distribution, and more. It is limited to demographic information that has been publicly documented. |
Metadata Format | The file formats in which machine-readable metadata is provided by the corpus. |
Translation | Which languages the primary data is translated into and how much of it has been translated. |
Annotation | How much data has been annotated and which annotation conventions were used. If possible, a reference to the conventions is provided, otherwise information is paraphrased. |
Data Format | The file formats in which the annotation/translation data of the corpus is provided. |
Licence | The licence conditions for using the dataset. These may be commonly used licences such as those by Creative Commons or custom licences defined for the dataset. A link to the licence is provided where possible. |
Access | Describes how public and restricted data can be accessed. If the dataset has both public and restricted parts, this category identifies which parts of it are public. The termonology used for this category is to be understood as follows:
|
Webpages | A list of relevant websites, such as those for the project, the research dataset, or portals for access by the general public. |
Institution | List of the universities or other organisations by which the dataset was created. |
Publications | Important bibliographic references for the resource. If an external list of publications for the resource exist, a link to it is included here. |
Task | The data collection task in question. Provides a link to the task entry. |
---|---|
# recordings – open access | The number of recordings that are available in the publicly accessible part of the corpus. |
# recordings – restricted access | The number of recordings that are only available in the non-public part of the corpus. |
Data available | Links to the corpus recordings of this task, where available. Where possible these links will connect only to the given task; otherwise disambiguating notes are provided to help find the task on the referenced page. |
Lexical Resource Entries
The collection of lexical resources includes both lexical databases as well as digital dictionaries. Each lexical resource entry consists of a brief description, followed by an information table. The Cite As
section specifies how the resource itself should be cited according to its creators. The last elements are a list of articles mentioned in the entry and the date when the entry was last inspected.
Languages | The languages used in the lexical resource. As most lexical resources can be used as bilingual dictionaries to some degree, this covers both signed and spoken languages. |
---|---|
Size | Number of lexical items. Items are identified as signs or types depending on the resource. |
Linguistic Information | Which linguistic information is provided for lexical items, such as ID-glosses, translational equivalents, citation form video, meanings, phonetic transcription or categorisations, frequency and other statistics, list of corpus occurrences and more. |
Licence | The licence conditions for using the dataset. These may be commonly used licences such as those by Creative Commons or custom licences defined for the dataset. A link to the licence is provided where possible. |
Access | Describes how public and restricted data can be accessed. If the dataset has both public and restricted parts, this category identifies which parts of it are public. The termonology used for this category is to be understood as follows:
|
Webpages | A list of relevant websites, such as those for the project, the research dataset, or portals for access by the general public. |
Institution | List of the universities or other organisations by which the dataset was created. |
Publications | Important bibliographic references for the resource. If an external list of publications for the resource exist, a link to it is included here. |
Data Collection Task Entries
During corpus data collection, participants are guided by a series of tasks, such as retelling a story or open discussion of a given topic. The compendium provides a collection of commonly used such tasks. This collection is intended to help with finding corpora that have comparable contents. Individual entries may cover broad concepts, such as open discussion
or specific materials, such as a specific story to be retold.
Each data collection task entry consist of a brief description, an information table and a series of tables detailing occurrences of the task in specific corpora. These task-corpus pairings are further subdivided by language, so multilingual corpora may be covered by multiple tables. This is followed by a list of references, where applicable, and the date when the entry was last inspected.
Stimulus | Brief description of the stimulus provided to participants. |
---|---|
Target | The linguistic phenomena that the task is intended to elicit. |
Degree of Interaction | An estimate whether the task usually results in a low, medium or high amount of interaction between participants. A reason for the degree may be given as a comment. |
Duration | An estimate of how long the tasks usually lasts, based on instances observed in corpus data or published documentation. |
Source | References to the material used in the task (e.g. books, films) or to scientific publications providing a definition of the task. |
Corpus | The corpus in question. Provides a link to the corpus entry. |
---|---|
Language | The language used for this task in the given corpus. If the task occurs with multiple languages in a corpus, a separate table for each language is given. Provides a link to the language entry. |
# recordings – open access | The number of recordings that are available in the publicly accessible part of the corpus. |
# recordings – restricted access | The number of recordings that are only available in the non-public part of the corpus. |
Data available | Links to the corpus recordings of this task, where available. Where possible these links will connect only to the given task; otherwise disambiguating notes are provided to help find the task on the referenced page. |
Language Entries
The compendium provides an index of the languages covered by its resources. Information on languages is taken from Glottolog, Ethnologue and, in a few cases, Wikipedia.
As sign languages often go by a number of different English and local names and acronyms, we list commonly used ones, roughly sorted by which variants are preferred within the language community and by how commonly they are used locally and in research. The most preferred English name and (where applicable) most preferred acronym of each language are shown in the language index. However, each language can still be found by all its other names, acronyms and identifiers by typing them in the provided search filter.
Due to the extensive number of languages and our limited knowledge of many of them, we cannot guarantee that all of the listed names are correct or in use. Should we be missing a name or list an outdated or even discriminatory/offensive name, please contact contact us and we will correct the entry.
Each language lists a variety of common names and identifiers for it, followed by lists of corpora and lexical resources that contain data for the language.
ISO 639-3 | The unique identifier of the language in the ISO 639-3 code table. |
---|---|
Glottolog | The unique glottocode identifier of the language in the Glottolog database. |
Acronyms | Language acronyms commonly used by the language community or in research publications. |
English names | English names for the language. |
Local names | Names for the language used in its native region. So far this is limited to languages with a written form, which unfortunately prevents the representation of sign language names in their own language. For names written in other scripts than the latin alphabet, a transliteration is also provided. |